Wilson Mar bio photo

Wilson Mar

Hello!

Email me Calendar Skype call

LinkedIn Twitter Gitter Instagram Youtube

Github Stackoverflow Pinterest

Recognizes text and special characters in image files (after Imagemagic), for 60+ languages (using LTSM machine-learning). Used by Selenium

US (English)   Español (Spanish)   Français (French)   Deutsch (German)   Italiano   Português   Cyrillic Russian   中文 (简体) Chinese (Simplified)   日本語 Japanese   한국어 Korean

Overview

The word “Tesseract” was adopted as the name of the OCR (Optical Character Recognition) engine program because it is able to recognize multiple-directional 3D lines.

tesseract-mcu2012-310x310.png The Tesseract shown in the Marvel Cinematic Universe is a (3 dimensional) physical cube. But the object has a 4th dimension of time, thus enabling time travel in the MCU and in Madeleine L’Engle’s novel/movie “A Wrinkle in Time”.

VIDEO: tesseract-4d-proj-275x203.png But a Tesseract in science (real life) is conceptual “w” 4th dimensional axis shown as a shadow.*

Usage

  1. I wrote a shell script that converts the last file created in folder ~/Desktop and opens the output file in VSCode (using the code command):

    ./ocr.sh

    Optionally, specify a file name:

    ./ocr.sh "Screen Shot 2020-05-10 at 3.18.06 PM.png"

    Afterward, the image file is deleted.

Installation

Tesseract 4 is included with Ubuntu 18.04+.

  1. for various operating systems, install a pre-built executable binary at https://github.com/tesseract-ocr/tesseract/wiki.

    On macOS:

    brew install tesseract --HEAD
    pip install pytesseract
  2. Verify the version:

    tesseract -v
    
    tesseract 4.1.1
     leptonica-1.79.0
      libgif 5.2.1 : libjpeg 9d : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0 : libopenjp2 2.3.1
     Found AVX2
     Found AVX
     Found FMA
     Found SSE
    

    The http://www.leptonica.org dependency provides utilities for image processing and image analysis.

    Use Tesseract

    
    IN_FILE="tesseract-quick-brown-fox.png"
    tesseract "${IN_FILE}"  out
    
  3. PROTIP: Navigate to the folder where where other image files are captured to, usually:

    cd ~/Desktop
    

    Sample file

    from GitHub

  4. Download the sample image file (above) from the Tesseract web page we will turn into text:

    wget https://raw.githubusercontent.com/wilsonmar/DevSecOps/master/Tesseract/tesseract-quick-brown-fox.png
    
  5. Run Tesseract from that folder (the sample .png can also be .tiff, .jpg, .gif, .bmp, etc.)

    IN_FILE="tesseract-quick-brown-fox.png"
    tesseract "${IN_FILE}"  out
    

    Response:

    Tesseract Open Source OCR Engine v4.1.0 with Leptonica

    Tesseract’s default is to recognize text output format, use English language, and Page Segmentation Mode 3. Parameters are defined by this command:

    tesseract --help-extra
    Usage:
      tesseract --help | --help-extra | --help-psm | --help-oem | --version
      tesseract --list-langs [--tessdata-dir PATH]
      tesseract --print-parameters [options...] [configfile...]
      tesseract imagename|imagelist|stdin outputbase|stdout [options...] [configfile...]
     
    OCR options:
      --tessdata-dir PATH   Specify the location of tessdata path.
      --user-words PATH     Specify the location of user words file.
      --user-patterns PATH  Specify the location of user patterns file.
      --dpi VALUE           Specify DPI for input image.
      -l LANG[+LANG]        Specify language(s) used for OCR.
      -c VAR=VALUE          Set value for config variables.
                         Multiple -c arguments are allowed.
      --psm NUM             Specify page segmentation mode.
      --oem NUM             Specify OCR Engine mode.
    NOTE: These options must occur before any configfile.
     
    Page segmentation modes:
      0    Orientation and script detection (OSD) only.
      1    Automatic page segmentation with OSD.
      2    Automatic page segmentation, but no OSD, or OCR. (not implemented)
      3    Fully automatic page segmentation, but no OSD. (Default)
      4    Assume a single column of text of variable sizes.
      5    Assume a single uniform block of vertically aligned text.
      6    Assume a single uniform block of text.
      7    Treat the image as a single text line.
      8    Treat the image as a single word.
      9    Treat the image as a single word in a circle.
     10    Treat the image as a single character.
     11    Sparse text. Find as much text as possible in no particular order.
     12    Sparse text with OSD.
     13    Raw line. Treat the image as a single text line,
        bypassing hacks that are Tesseract-specific.
     
    OCR Engine modes:
      0    Legacy engine only.
      1    Neural nets LSTM engine only.
      2    Legacy + LSTM engines.
      3    Default, based on what is available.
     
    Single options:
      -h, --help            Show minimal help message.
      --help-extra          Show extra help for advanced users.
      --help-psm            Show page segmentation modes.
      --help-oem            Show OCR Engine modes.
      -v, --version         Show version information.
      --list-langs          List available languages for tesseract engine.
      --print-parameters    Print tesseract parameters.
    
  6. Use a text editor to view the contents of output file out.txt created by Tesseract based on the sample image file:

    The (quick) [brown] {fox} jumps!
    Over the $43,456.78 <lazy> #90 dog
    & duck/goose, as 12.5% of E-mail
    from aspammer@website.com is spam.
    Der ,.schnelle” braune Fuchs springt
    iiber den faulen Hund. Le renard brun
    «rapide» saute par-dessus le chien
    paresseux. La volpe marrone rapida
    salta sopra il cane pigro. El zorro
    marron rapido salta sobre el perro
    perezoso. A raposa marrom rapida
    salta sobre o céo preguicoso.
    

    Even though the image is slightly tilted, Tesseract should recogize all the various special characters such as curly braces, angle brackets, !, $, #, %, slash, and @ signs, etc. However:

    • the two dots in front of “schnelle” is mis-recognized,
    • the tilde on top of “céo” is wrongly recognized as a tick
    • the descender in “preguicoso” is not recognized

    Language recognition

    It did not recognize European language accents such the umlaut above Uber. “marron rapido” is supposed to be capped. “preguicoso” a Portugese word meaning lazy, does not have the diacritical tail appendage c-cedilla (cedilha in Portugese).

    But Tesseract is supposed to recognize characters from over 100 languages now. Originally from HP, @theRaySmith at Google says in 2016 Tesseract includes LSTM (Long Short Term Memory) machine learning algorithm with deep belief networks.

    NOTE: LSTM is a form of RNN (Recurrent Neural Network) algorithm to recognize a sequence of characters rather than single chacters (which is better handled by CNN (Convolutional Neural Networks).

  7. To get Tesseract to recognize the full set of language characters, run with additional parameters specifying more language codes from the wiki site:

    tesseract  tesseract-quick-brown-fox.png  out  -l eng+deu+fra+ita+spa+por
    

    Sequence of -language codes matter: deu = deutch (German) + fra = french + ita = italian + spa = spanish + por = portugese.

    Error messages are expected if additional configuration was not done:

    Error opening data file /usr/local/Cellar/tesseract/4.1.0/share/tessdata/deu.traineddata
    Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
    Failed loading language 'deu'
    Tesseract Open Source OCR Engine v4.1.0 with Leptonica

  8. List the default languages available:

    tesseract --list-langs
    

    Codes in the response the wiki site says “osd” = Orientation and script detection:

    List of available languages (3):
    eng
    osd
    snum
  9. So we install language files:

    brew install tesseract-lang

    It’s a large file installed from https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata

  10. Identify location of language files:

    brew list tesseract-lang

    Expected response:

    /usr/local/Cellar/tesseract-lang/4.0.0/share/tessdata/ (161 files)

    TODO: The version is a bit behind? The “4.0.0” in the path means it needs to be manually redone when a newer version is available.

  11. Define the path to the environment variable defining the path Tesseract looks:

    This path overrides the default of “/usr/local/share/tesseract-ocr/”.

    export TESSDATA_PREFIX="/usr/local/Cellar/tesseract/4.1.0/share/tessdata/"
    ls $TESSDATA_PREFIX
    
  12. The tesseract-lang folder created does not contain the default languages, so copy them in:

    cp -a /usr/local/Cellar/tesseract-lang/4.0.0/share/tessdata/. /usr/local/Cellar/tesseract/4.1.0/share/tessdata/
    ls -al
    
  13. To reclaim disk space:

    brew remove tesseract-lang
    Uninstalling /usr/local/Cellar/tesseract-lang/4.0.0... (163 files, 651.8MB)

    Tessocr for Python

    To list the languages again to see a long list, this time do it using a Python program using the tesserocr wrapper for Python.

  14. Install using pip:

    pip install tesserocr
    pip install Pillow
    

    The response at time of writing:

    Collecting tesserocr
      Downloading https://files.pythonhosted.org/packages/e3/77/fb26b321c3b9ce4a47af12b19e85ddbf4d0629adb6552d85276e824e6e51/tesserocr-2.5.0.tar.gz (54kB)
      |████████████████████████████████| 61kB 156kB/s 
    Building wheels for collected packages: tesserocr
      Building wheel for tesserocr (setup.py) ... done
      Created wheel for tesserocr: filename=tesserocr-2.5.0-cp37-cp37m-macosx_10_14_x86_64.whl size=169578 sha256=0ea0c430f6649a974c43805e9d8662fe0f14ef85f305e3fba6ded924cf4eb1a5
      Stored in directory: /Users/wilson_mar/Library/Caches/pip/wheels/c0/32/13/70d610c079b65b21a5fb84af4fe7593cbf06da35f69cf10209
    Successfully built tesserocr
    Installing collected packages: tesserocr
    Successfully installed tesserocr-2.5.0
    
  15. Run the Python program:

    python3 GetAvailableLanguages.py
    

    The response is a JSON file:

    [‘afr’, ‘amh’, ‘ara’, ‘asm’, ‘aze’, ‘aze_cyrl’, ‘bel’, ‘ben’, ‘bod’, ‘bos’, ‘bre’, ‘bul’, ‘cat’, ‘ceb’, ‘ces’, ‘chi_sim’, ‘chi_sim_vert’, ‘chi_tra’, ‘chi_tra_vert’, ‘chr’, ‘cos’, ‘cym’, ‘dan’, ‘deu’, ‘div’, ‘dzo’, ‘ell’, ‘eng’, ‘enm’, ‘epo’, ‘est’, ‘eus’, ‘fao’, ‘fas’, ‘fil’, ‘fin’, ‘fra’, ‘frk’, ‘frm’, ‘fry’, ‘gla’, ‘gle’, ‘glg’, ‘grc’, ‘guj’, ‘hat’, ‘heb’, ‘hin’, ‘hrv’, ‘hun’, ‘hye’, ‘iku’, ‘ind’, ‘isl’, ‘ita’, ‘ita_old’, ‘jav’, ‘jpn’, ‘jpn_vert’, ‘kan’, ‘kat’, ‘kat_old’, ‘kaz’, ‘khm’, ‘kir’, ‘kmr’, ‘kor’, ‘kor_vert’, ‘lao’, ‘lat’, ‘lav’, ‘lit’, ‘ltz’, ‘mal’, ‘mar’, ‘mkd’, ‘mlt’, ‘mon’, ‘mri’, ‘msa’, ‘mya’, ‘nep’, ‘nld’, ‘nor’, ‘oci’, ‘ori’, ‘osd’, ‘pan’, ‘pol’, ‘por’, ‘pus’, ‘que’, ‘ron’, ‘rus’, ‘san’, ‘script/Arabic’, ‘script/Armenian’, ‘script/Bengali’, ‘script/Canadian_Aboriginal’, ‘script/Cherokee’, ‘script/Cyrillic’, ‘script/Devanagari’, ‘script/Ethiopic’, ‘script/Fraktur’, ‘script/Georgian’, ‘script/Greek’, ‘script/Gujarati’, ‘script/Gurmukhi’, ‘script/HanS’, ‘script/HanS_vert’, ‘script/HanT’, ‘script/HanT_vert’, ‘script/Hangul’, ‘script/Hangul_vert’, ‘script/Hebrew’, ‘script/Japanese’, ‘script/Japanese_vert’, ‘script/Kannada’, ‘script/Khmer’, ‘script/Lao’, ‘script/Latin’, ‘script/Malayalam’, ‘script/Myanmar’, ‘script/Oriya’, ‘script/Sinhala’, ‘script/Syriac’, ‘script/Tamil’, ‘script/Telugu’, ‘script/Thaana’, ‘script/Thai’, ‘script/Tibetan’, ‘script/Vietnamese’, ‘sin’, ‘slk’, ‘slv’, ‘snd’, ‘snum’, ‘spa’, ‘spa_old’, ‘sqi’, ‘srp’, ‘srp_latn’, ‘sun’, ‘swa’, ‘swe’, ‘syr’, ‘tam’, ‘tat’, ‘tel’, ‘tessconfigs/afr’, ‘tessconfigs/amh’, ‘tessconfigs/ara’, ‘tessconfigs/asm’, ‘tessconfigs/aze’, ‘tessconfigs/aze_cyrl’, ‘tessconfigs/bel’, ‘tessconfigs/ben’, ‘tessconfigs/bod’, ‘tessconfigs/bos’, ‘tessconfigs/bre’, ‘tessconfigs/bul’, ‘tessconfigs/cat’, ‘tessconfigs/ceb’, ‘tessconfigs/ces’, ‘tessconfigs/chi_sim’, ‘tessconfigs/chi_sim_vert’, ‘tessconfigs/chi_tra’, ‘tessconfigs/chi_tra_vert’, ‘tessconfigs/chr’, ‘tessconfigs/cos’, ‘tessconfigs/cym’, ‘tessconfigs/dan’, ‘tessconfigs/deu’, ‘tessconfigs/div’, ‘tessconfigs/dzo’, ‘tessconfigs/ell’, ‘tessconfigs/eng’, ‘tessconfigs/enm’, ‘tessconfigs/epo’, ‘tessconfigs/est’, ‘tessconfigs/eus’, ‘tessconfigs/fao’, ‘tessconfigs/fas’, ‘tessconfigs/fil’, ‘tessconfigs/fin’, ‘tessconfigs/fra’, ‘tessconfigs/frk’, ‘tessconfigs/frm’, ‘tessconfigs/fry’, ‘tessconfigs/gla’, ‘tessconfigs/gle’, ‘tessconfigs/glg’, ‘tessconfigs/grc’, ‘tessconfigs/guj’, ‘tessconfigs/hat’, ‘tessconfigs/heb’, ‘tessconfigs/hin’, ‘tessconfigs/hrv’, ‘tessconfigs/hun’, ‘tessconfigs/hye’, ‘tessconfigs/iku’, ‘tessconfigs/ind’, ‘tessconfigs/isl’, ‘tessconfigs/ita’, ‘tessconfigs/ita_old’, ‘tessconfigs/jav’, ‘tessconfigs/jpn’, ‘tessconfigs/jpn_vert’, ‘tessconfigs/kan’, ‘tessconfigs/kat’, ‘tessconfigs/kat_old’, ‘tessconfigs/kaz’, ‘tessconfigs/khm’, ‘tessconfigs/kir’, ‘tessconfigs/kmr’, ‘tessconfigs/kor’, ‘tessconfigs/kor_vert’, ‘tessconfigs/lao’, ‘tessconfigs/lat’, ‘tessconfigs/lav’, ‘tessconfigs/lit’, ‘tessconfigs/ltz’, ‘tessconfigs/mal’, ‘tessconfigs/mar’, ‘tessconfigs/mkd’, ‘tessconfigs/mlt’, ‘tessconfigs/mon’, ‘tessconfigs/mri’, ‘tessconfigs/msa’, ‘tessconfigs/mya’, ‘tessconfigs/nep’, ‘tessconfigs/nld’, ‘tessconfigs/nor’, ‘tessconfigs/oci’, ‘tessconfigs/ori’, ‘tessconfigs/osd’, ‘tessconfigs/pan’, ‘tessconfigs/pol’, ‘tessconfigs/por’, ‘tessconfigs/pus’, ‘tessconfigs/que’, ‘tessconfigs/ron’, ‘tessconfigs/rus’, ‘tessconfigs/san’, ‘tessconfigs/sin’, ‘tessconfigs/slk’, ‘tessconfigs/slv’, ‘tessconfigs/snd’, ‘tessconfigs/snum’, ‘tessconfigs/spa’, ‘tessconfigs/spa_old’, ‘tessconfigs/sqi’, ‘tessconfigs/srp’, ‘tessconfigs/srp_latn’, ‘tessconfigs/sun’, ‘tessconfigs/swa’, ‘tessconfigs/swe’, ‘tessconfigs/syr’, ‘tessconfigs/tam’, ‘tessconfigs/tat’, ‘tessconfigs/tel’, ‘tessconfigs/tgk’, ‘tessconfigs/tha’, ‘tessconfigs/tir’, ‘tessconfigs/ton’, ‘tessconfigs/tur’, ‘tessconfigs/uig’, ‘tessconfigs/ukr’, ‘tessconfigs/urd’, ‘tessconfigs/uzb’, ‘tessconfigs/uzb_cyrl’, ‘tessconfigs/vie’, ‘tessconfigs/yid’, ‘tessconfigs/yor’, ‘tgk’, ‘tha’, ‘tir’, ‘ton’, ‘tur’, ‘uig’, ‘ukr’, ‘urd’, ‘uzb’, ‘uzb_cyrl’, ‘vie’, ‘yid’, ‘yor’]

    Run to see accents

  16. Run again, then edit the out.txt file again. You should now see accent characters:

    Der „schnelle” braune Fuchs springt
    über den faulen Hund. Le renard brun
    «rapide» saute par-dessus le chien
    paresseux. La volpe marrone rapida
    salta sopra il cane pigro. EI zorro
    marrön räpido salta sobre el perro
    perezoso. A raposa marrom räpida
    salta sobre o cão preguiçoso.
    
  17. To output to a pdf file instead of txt file, add “pdf” to the end of the command.

    Asian characters

    When specifying the language code for Chinese, note there are: chi_sim and chi_tra for left-to-right and chi_sim_vert and chi_tra_vert for vertical.

    Japanes and Korean language coded behave the same way.

Image Preparation

If you need to convert images, use the popular open-source https://imagemagick.org

  1. Install using HomeBrew (instead of downloading, gunzip, variables, etc.):

    brew install imagemagick
  2. Because ImageMagick depends on Ghostscript fonts, install them as well:

    brew install ghostscript
  3. To convert a file (such as a pdf) into a high-resolution image, use Imagemagick’s convert command:

    convert -density 300 test.pdf -depth 8 -strip -background white -alpha off out.tiff
    

    This also takes off Alpha channels and outputs to a TIFF format file.

    Alternative parameters are “-monochrome” to convert to black-and-white.

    The last parameter is the output file.

Resources for this section include:


Usage within Python

Within venv

https://github.com/sirfz/tesserocr

  1. Install Pillow, a module for image processing in Python:

    pip install Pillow

  • Code for Python:

    https://medium.com/better-programming/beginners-guide-to-tesseract-ocr-using-python-10ecbb426c3d offers this snippet:

    from PIL import Image  # PIL = old version of Pillow utility
    column = Image.open('code.jpg')
    gray = column.convert('L')    # convert to gray scale vs. RGB or CMYK.
    blackwhite = gray.point(lambda x: 0 if x < 200 else 255, '1')
    blackwhite.save("code_bw.jpg") # TODO: change to use program invocation parameter
     
  • Code for Shell script:

    from PIL import Image
    import sys
    column = Image.open(sys.argv[1])
    gray = column.convert('L')
    blackwhite = gray.point(lambda x: 0 if x < 200 else 255, '1')
    blackwhite.save("code_bw.jpg")
     

Usage within Selenium

Selenium scripts can make use of Tesseract’s CLI call.

Alternately, Java coders can use Tess4j at https://sourceforge.net/projects/tess4j by adding to pom.xml file add it as a dependency, such as:

    <dependencies>
        <dependency>
            <groupId>net.sourceforge.tess4j</groupId>
            <artifactId>tess4j</artifactId>
            <version>2.0.0</version>
            <scope>test</scope>
        </dependency>
   

Then, in your JUnit test file, add at the top:

import net.sourceforge.tess4j.*;

VIDEO (no sound): Sample code is at https://unmesh.me/2015/06/30/using-tesseract-with-selenium-webdriver-for-checking-text-on-images-using-ocr/

VIDEO: How to set up Tess4j in Eclipse [27:38] per this blog.

Tess4j is actually written in C#. However, those who code C# can use the <a target=”_blank” href=”http://www.emgu.com/wiki/index.php/Emgu_CV”Emgu</a> .Net wrapper library.

Resources/References

  • https://github.com/gulakov/tesseract-ocr-sample (Visual Studio C++ Project)
  • http://blog.ayoungprogrammer.com/2012/11/tutorial-installing-tesseract-ocr-30202.html/

More

This is one of a series on AI, Machine Learning, Deep Learning, Robotics, and Analytics:

  1. AI Ecosystem
  2. Machine Learning
  3. Testing AI

  4. Microsoft’s AI
  5. Microsoft’s Azure Machine Learning Algorithms
  6. Microsoft’s Azure Machine Learning tutorial
  7. Microsoft’s Azure Machine Learning certification

  8. Python installation
  9. Juypter notebooks processing Python for humans

  10. Image Processing
  11. Tessaract OCR using OpenCV
  12. Amazon Lex text to speech

  13. Code Generation

  14. Multiple Regression calculation and visualization using Excel and Machine Learning
  15. Tableau Data Visualization