Tesseract (OCR)

Recognizes text and special characters in image files (after Imagemagic), for 60+ languages (using LTSM machine-learning). Used by Selenium

Overview

Usage
Installation
Image Preparation
Usage within Python
Usage within Selenium
Resources/References
More

The word “Tesseract” was adopted as the name of the OCR (Optical Character Recognition) engine program because it is able to recognize multiple-directional 3D lines.

NOTE: Content here are my personal opinions, and not intended to represent any employer (past or present). “PROTIP:” here highlight information I haven’t seen elsewhere on the internet because it is hard-won, little-know but significant facts based on my personal research and experience.

The Tesseract shown in the Marvel Cinematic Universe is a (3 dimensional) physical cube. But the object has a 4th dimension of time, thus enabling time travel in the MCU and in Madeleine L’Engle’s novel/movie “A Wrinkle in Time”.

VIDEO: But a Tesseract in science (real life) is conceptual “w” 4th dimensional axis shown as a shadow.*

Usage

I wrote a shell script that converts the last file created in folder ~/Desktop and opens the output file in VSCode (using the code command):
```
./ocr.sh
```
Optionally, specify a file name:
```
./ocr.sh "Screen Shot 2020-05-10 at 3.18.06 PM.png"
```
Afterward, the image file is deleted.

Installation

Tesseract 4 is included with Ubuntu 18.04+.

for various operating systems, install a pre-built executable binary at https://github.com/tesseract-ocr/tesseract/wiki.

On macOS:
```
brew install tesseract --HEAD
pip install pytesseract
```

Verify the version:

tesseract -v

tesseract 4.1.1
 leptonica-1.79.0
  libgif 5.2.1 : libjpeg 9d : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0 : libopenjp2 2.3.1
 Found AVX2
 Found AVX
 Found FMA
 Found SSE

The http://www.leptonica.org dependency provides utilities for image processing and image analysis.

Use Tesseract


IN_FILE="tesseract-quick-brown-fox.png"
tesseract "${IN_FILE}"  out

PROTIP: Navigate to the folder where where other image files are captured to, usually:
```
cd ~/Desktop
```
Sample file

Download the sample image file (above) from the Tesseract web page we will turn into text:

wget https://raw.githubusercontent.com/wilsonmar/DevSecOps/master/Tesseract/tesseract-quick-brown-fox.png

Run Tesseract from that folder (the sample .png can also be .tiff, .jpg, .gif, .bmp, etc.)

IN_FILE="tesseract-quick-brown-fox.png"
tesseract "${IN_FILE}"  out

Response:

Tesseract Open Source OCR Engine v4.1.0 with Leptonica

Tesseract’s default is to recognize text output format, use English language, and Page Segmentation Mode 3. Parameters are defined by this command:

tesseract --help-extra

Usage:
  tesseract --help | --help-extra | --help-psm | --help-oem | --version
  tesseract --list-langs [--tessdata-dir PATH]
  tesseract --print-parameters [options...] [configfile...]
  tesseract imagename|imagelist|stdin outputbase|stdout [options...] [configfile...]
 
OCR options:
  --tessdata-dir PATH   Specify the location of tessdata path.
  --user-words PATH     Specify the location of user words file.
  --user-patterns PATH  Specify the location of user patterns file.
  --dpi VALUE           Specify DPI for input image.
  -l LANG[+LANG]        Specify language(s) used for OCR.
  -c VAR=VALUE          Set value for config variables.
                     Multiple -c arguments are allowed.
  --psm NUM             Specify page segmentation mode.
  --oem NUM             Specify OCR Engine mode.
NOTE: These options must occur before any configfile.
 
Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR. (not implemented)
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
    bypassing hacks that are Tesseract-specific.
 
OCR Engine modes:
  0    Legacy engine only.
  1    Neural nets LSTM engine only.
  2    Legacy + LSTM engines.
  3    Default, based on what is available.
 
Single options:
  -h, --help            Show minimal help message.
  --help-extra          Show extra help for advanced users.
  --help-psm            Show page segmentation modes.
  --help-oem            Show OCR Engine modes.
  -v, --version         Show version information.
  --list-langs          List available languages for tesseract engine.
  --print-parameters    Print tesseract parameters.

Use a text editor to view the contents of output file out.txt created by Tesseract based on the sample image file:
```
The (quick) [brown] {fox} jumps!
Over the $43,456.78 <lazy> #90 dog
& duck/goose, as 12.5% of E-mail
from aspammer@website.com is spam.
Der ,.schnelle” braune Fuchs springt
iiber den faulen Hund. Le renard brun
«rapide» saute par-dessus le chien
paresseux. La volpe marrone rapida
salta sopra il cane pigro. El zorro
marron rapido salta sobre el perro
perezoso. A raposa marrom rapida
salta sobre o céo preguicoso.
```
Even though the image is slightly tilted, Tesseract should recogize all the various special characters such as curly braces, angle brackets, !, $, #, %, slash, and @ signs, etc. However:
- the two dots in front of “schnelle” is mis-recognized,
- the tilde on top of “céo” is wrongly recognized as a tick
- the descender in “preguicoso” is not recognized
Language recognition

It did not recognize European language accents such the umlaut above Uber. “marron rapido” is supposed to be capped. “preguicoso” a Portugese word meaning lazy, does not have the diacritical tail appendage c-cedilla (cedilha in Portugese).

But Tesseract is supposed to recognize characters from over 100 languages now. Originally from HP, @theRaySmith at Google says in 2016 Tesseract includes LSTM (Long Short Term Memory) machine learning algorithm with deep belief networks.

NOTE: LSTM is a form of RNN (Recurrent Neural Network) algorithm to recognize a sequence of characters rather than single chacters (which is better handled by CNN (Convolutional Neural Networks).

To get Tesseract to recognize the full set of language characters, run with additional parameters specifying more language codes from the wiki site:

tesseract  tesseract-quick-brown-fox.png  out  -l eng+deu+fra+ita+spa+por

Sequence of -language codes matter: deu = deutch (German) + fra = french + ita = italian + spa = spanish + por = portugese.

Error messages are expected if additional configuration was not done:

Error opening data file /usr/local/Cellar/tesseract/4.1.0/share/tessdata/deu.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'deu'
Tesseract Open Source OCR Engine v4.1.0 with Leptonica

List the default languages available:
```
tesseract --list-langs
```
Codes in the response the wiki site says “osd” = Orientation and script detection:
```
List of available languages (3):
eng
osd
snum
```
So we install language files:
```
brew install tesseract-lang
```
It’s a large file installed from https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata
Identify location of language files:
```
brew list tesseract-lang
```
Expected response:
```
/usr/local/Cellar/tesseract-lang/4.0.0/share/tessdata/ (161 files)
```
TODO: The version is a bit behind? The “4.0.0” in the path means it needs to be manually redone when a newer version is available.
Define the path to the environment variable defining the path Tesseract looks:

This path overrides the default of “/usr/local/share/tesseract-ocr/”.
```
export TESSDATA_PREFIX="/usr/local/Cellar/tesseract/4.1.0/share/tessdata/"
ls $TESSDATA_PREFIX
```

The tesseract-lang folder created does not contain the default languages, so copy them in:

cp -a /usr/local/Cellar/tesseract-lang/4.0.0/share/tessdata/. /usr/local/Cellar/tesseract/4.1.0/share/tessdata/
ls -al

To reclaim disk space:
```
brew remove tesseract-lang
```
```
Uninstalling /usr/local/Cellar/tesseract-lang/4.0.0... (163 files, 651.8MB)
```
Tessocr for Python

To list the languages again to see a long list, this time do it using a Python program using the tesserocr wrapper for Python.

Install using pip:

pip install tesserocr
pip install Pillow

The response at time of writing:

Collecting tesserocr
  Downloading https://files.pythonhosted.org/packages/e3/77/fb26b321c3b9ce4a47af12b19e85ddbf4d0629adb6552d85276e824e6e51/tesserocr-2.5.0.tar.gz (54kB)
  |████████████████████████████████| 61kB 156kB/s 
Building wheels for collected packages: tesserocr
  Building wheel for tesserocr (setup.py) ... done
  Created wheel for tesserocr: filename=tesserocr-2.5.0-cp37-cp37m-macosx_10_14_x86_64.whl size=169578 sha256=0ea0c430f6649a974c43805e9d8662fe0f14ef85f305e3fba6ded924cf4eb1a5
  Stored in directory: /Users/wilson_mar/Library/Caches/pip/wheels/c0/32/13/70d610c079b65b21a5fb84af4fe7593cbf06da35f69cf10209
Successfully built tesserocr
Installing collected packages: tesserocr
Successfully installed tesserocr-2.5.0

Run the Python program:
```
python3 GetAvailableLanguages.py
```
The response is a JSON file:

[‘afr’, ‘amh’, ‘ara’, ‘asm’, ‘aze’, ‘aze_cyrl’, ‘bel’, ‘ben’, ‘bod’, ‘bos’, ‘bre’, ‘bul’, ‘cat’, ‘ceb’, ‘ces’, ‘chi_sim’, ‘chi_sim_vert’, ‘chi_tra’, ‘chi_tra_vert’, ‘chr’, ‘cos’, ‘cym’, ‘dan’, ‘deu’, ‘div’, ‘dzo’, ‘ell’, ‘eng’, ‘enm’, ‘epo’, ‘est’, ‘eus’, ‘fao’, ‘fas’, ‘fil’, ‘fin’, ‘fra’, ‘frk’, ‘frm’, ‘fry’, ‘gla’, ‘gle’, ‘glg’, ‘grc’, ‘guj’, ‘hat’, ‘heb’, ‘hin’, ‘hrv’, ‘hun’, ‘hye’, ‘iku’, ‘ind’, ‘isl’, ‘ita’, ‘ita_old’, ‘jav’, ‘jpn’, ‘jpn_vert’, ‘kan’, ‘kat’, ‘kat_old’, ‘kaz’, ‘khm’, ‘kir’, ‘kmr’, ‘kor’, ‘kor_vert’, ‘lao’, ‘lat’, ‘lav’, ‘lit’, ‘ltz’, ‘mal’, ‘mar’, ‘mkd’, ‘mlt’, ‘mon’, ‘mri’, ‘msa’, ‘mya’, ‘nep’, ‘nld’, ‘nor’, ‘oci’, ‘ori’, ‘osd’, ‘pan’, ‘pol’, ‘por’, ‘pus’, ‘que’, ‘ron’, ‘rus’, ‘san’, ‘script/Arabic’, ‘script/Armenian’, ‘script/Bengali’, ‘script/Canadian_Aboriginal’, ‘script/Cherokee’, ‘script/Cyrillic’, ‘script/Devanagari’, ‘script/Ethiopic’, ‘script/Fraktur’, ‘script/Georgian’, ‘script/Greek’, ‘script/Gujarati’, ‘script/Gurmukhi’, ‘script/HanS’, ‘script/HanS_vert’, ‘script/HanT’, ‘script/HanT_vert’, ‘script/Hangul’, ‘script/Hangul_vert’, ‘script/Hebrew’, ‘script/Japanese’, ‘script/Japanese_vert’, ‘script/Kannada’, ‘script/Khmer’, ‘script/Lao’, ‘script/Latin’, ‘script/Malayalam’, ‘script/Myanmar’, ‘script/Oriya’, ‘script/Sinhala’, ‘script/Syriac’, ‘script/Tamil’, ‘script/Telugu’, ‘script/Thaana’, ‘script/Thai’, ‘script/Tibetan’, ‘script/Vietnamese’, ‘sin’, ‘slk’, ‘slv’, ‘snd’, ‘snum’, ‘spa’, ‘spa_old’, ‘sqi’, ‘srp’, ‘srp_latn’, ‘sun’, ‘swa’, ‘swe’, ‘syr’, ‘tam’, ‘tat’, ‘tel’, ‘tessconfigs/afr’, ‘tessconfigs/amh’, ‘tessconfigs/ara’, ‘tessconfigs/asm’, ‘tessconfigs/aze’, ‘tessconfigs/aze_cyrl’, ‘tessconfigs/bel’, ‘tessconfigs/ben’, ‘tessconfigs/bod’, ‘tessconfigs/bos’, ‘tessconfigs/bre’, ‘tessconfigs/bul’, ‘tessconfigs/cat’, ‘tessconfigs/ceb’, ‘tessconfigs/ces’, ‘tessconfigs/chi_sim’, ‘tessconfigs/chi_sim_vert’, ‘tessconfigs/chi_tra’, ‘tessconfigs/chi_tra_vert’, ‘tessconfigs/chr’, ‘tessconfigs/cos’, ‘tessconfigs/cym’, ‘tessconfigs/dan’, ‘tessconfigs/deu’, ‘tessconfigs/div’, ‘tessconfigs/dzo’, ‘tessconfigs/ell’, ‘tessconfigs/eng’, ‘tessconfigs/enm’, ‘tessconfigs/epo’, ‘tessconfigs/est’, ‘tessconfigs/eus’, ‘tessconfigs/fao’, ‘tessconfigs/fas’, ‘tessconfigs/fil’, ‘tessconfigs/fin’, ‘tessconfigs/fra’, ‘tessconfigs/frk’, ‘tessconfigs/frm’, ‘tessconfigs/fry’, ‘tessconfigs/gla’, ‘tessconfigs/gle’, ‘tessconfigs/glg’, ‘tessconfigs/grc’, ‘tessconfigs/guj’, ‘tessconfigs/hat’, ‘tessconfigs/heb’, ‘tessconfigs/hin’, ‘tessconfigs/hrv’, ‘tessconfigs/hun’, ‘tessconfigs/hye’, ‘tessconfigs/iku’, ‘tessconfigs/ind’, ‘tessconfigs/isl’, ‘tessconfigs/ita’, ‘tessconfigs/ita_old’, ‘tessconfigs/jav’, ‘tessconfigs/jpn’, ‘tessconfigs/jpn_vert’, ‘tessconfigs/kan’, ‘tessconfigs/kat’, ‘tessconfigs/kat_old’, ‘tessconfigs/kaz’, ‘tessconfigs/khm’, ‘tessconfigs/kir’, ‘tessconfigs/kmr’, ‘tessconfigs/kor’, ‘tessconfigs/kor_vert’, ‘tessconfigs/lao’, ‘tessconfigs/lat’, ‘tessconfigs/lav’, ‘tessconfigs/lit’, ‘tessconfigs/ltz’, ‘tessconfigs/mal’, ‘tessconfigs/mar’, ‘tessconfigs/mkd’, ‘tessconfigs/mlt’, ‘tessconfigs/mon’, ‘tessconfigs/mri’, ‘tessconfigs/msa’, ‘tessconfigs/mya’, ‘tessconfigs/nep’, ‘tessconfigs/nld’, ‘tessconfigs/nor’, ‘tessconfigs/oci’, ‘tessconfigs/ori’, ‘tessconfigs/osd’, ‘tessconfigs/pan’, ‘tessconfigs/pol’, ‘tessconfigs/por’, ‘tessconfigs/pus’, ‘tessconfigs/que’, ‘tessconfigs/ron’, ‘tessconfigs/rus’, ‘tessconfigs/san’, ‘tessconfigs/sin’, ‘tessconfigs/slk’, ‘tessconfigs/slv’, ‘tessconfigs/snd’, ‘tessconfigs/snum’, ‘tessconfigs/spa’, ‘tessconfigs/spa_old’, ‘tessconfigs/sqi’, ‘tessconfigs/srp’, ‘tessconfigs/srp_latn’, ‘tessconfigs/sun’, ‘tessconfigs/swa’, ‘tessconfigs/swe’, ‘tessconfigs/syr’, ‘tessconfigs/tam’, ‘tessconfigs/tat’, ‘tessconfigs/tel’, ‘tessconfigs/tgk’, ‘tessconfigs/tha’, ‘tessconfigs/tir’, ‘tessconfigs/ton’, ‘tessconfigs/tur’, ‘tessconfigs/uig’, ‘tessconfigs/ukr’, ‘tessconfigs/urd’, ‘tessconfigs/uzb’, ‘tessconfigs/uzb_cyrl’, ‘tessconfigs/vie’, ‘tessconfigs/yid’, ‘tessconfigs/yor’, ‘tgk’, ‘tha’, ‘tir’, ‘ton’, ‘tur’, ‘uig’, ‘ukr’, ‘urd’, ‘uzb’, ‘uzb_cyrl’, ‘vie’, ‘yid’, ‘yor’]

Run to see accents

Run again, then edit the out.txt file again. You should now see accent characters:

Der „schnelle” braune Fuchs springt
über den faulen Hund. Le renard brun
«rapide» saute par-dessus le chien
paresseux. La volpe marrone rapida
salta sopra il cane pigro. EI zorro
marrön räpido salta sobre el perro
perezoso. A raposa marrom räpida
salta sobre o cão preguiçoso.

To output to a pdf file instead of txt file, add “pdf” to the end of the command.

Asian characters

When specifying the language code for Chinese, note there are: chi_sim and chi_tra for left-to-right and chi_sim_vert and chi_tra_vert for vertical.

Japanes and Korean language coded behave the same way.

Image Preparation

If you need to convert images, use the popular open-source https://imagemagick.org

Install using HomeBrew (instead of downloading, gunzip, variables, etc.):
```
brew install imagemagick
```
Because ImageMagick depends on Ghostscript fonts, install them as well:
```
brew install ghostscript
```
To convert a file (such as a pdf) into a high-resolution image, use Imagemagick’s convert command:
```
convert -density 300 test.pdf -depth 8 -strip -background white -alpha off out.tiff
```
This also takes off Alpha channels and outputs to a TIFF format file.

Alternative parameters are “-monochrome” to convert to black-and-white.

The last parameter is the output file.

Resources for this section include:

Using Tesseract-OCR to extract text from images

Usage within Python

Within venv

https://github.com/sirfz/tesserocr

Install Pillow, a module for image processing in Python:

pip install Pillow

Code for Python:

https://medium.com/better-programming/beginners-guide-to-tesseract-ocr-using-python-10ecbb426c3d offers this snippet:

from PIL import Image  # PIL = old version of Pillow utility
column = Image.open('code.jpg')
gray = column.convert('L')    # convert to gray scale vs. RGB or CMYK.
blackwhite = gray.point(lambda x: 0 if x < 200 else 255, '1')
blackwhite.save("code_bw.jpg") # TODO: change to use program invocation parameter

Code for Shell script:

from PIL import Image
import sys
column = Image.open(sys.argv[1])
gray = column.convert('L')
blackwhite = gray.point(lambda x: 0 if x < 200 else 255, '1')
blackwhite.save("code_bw.jpg")

Usage within Selenium

Selenium scripts can make use of Tesseract’s CLI call.

Alternately, Java coders can use Tess4j at https://sourceforge.net/projects/tess4j by adding to pom.xml file add it as a dependency, such as:

    <dependencies>
        <dependency>
            <groupId>net.sourceforge.tess4j</groupId>
            <artifactId>tess4j</artifactId>
            <version>2.0.0</version>
            <scope>test</scope>
        </dependency>

Then, in your JUnit test file, add at the top:

import net.sourceforge.tess4j.*;

VIDEO (no sound): Sample code is at https://unmesh.me/2015/06/30/using-tesseract-with-selenium-webdriver-for-checking-text-on-images-using-ocr/

VIDEO: How to set up Tess4j in Eclipse [27:38] per this blog.

Tess4j is actually written in C#. However, those who code C# can use the <a target=”_blank” href=”http://www.emgu.com/wiki/index.php/Emgu_CV”Emgu</a> .Net wrapper library.

Resources/References

https://github.com/gulakov/tesseract-ocr-sample (Visual Studio C++ Project)
http://blog.ayoungprogrammer.com/2012/11/tutorial-installing-tesseract-ocr-30202.html/

This is one of a series on AI, Machine Learning, Deep Learning, Robotics, and Analytics:

Wilson Mar

Tesseract (OCR)

Usage

Installation

Use Tesseract

Sample file

Language recognition

Tessocr for Python

Run to see accents

Asian characters

Image Preparation

Usage within Python

Usage within Selenium

Resources/References

More

You might also enjoy (View all posts)