Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Limit characters tesseract is looking for

Tags:

ocr

tesseract

Is it possible to limit the set of characters that tesseract is looking for (e.g. search only for letters a-z)? That would improve my results greatly.

like image 985
Danilo Bargen Avatar asked Mar 02 '10 13:03

Danilo Bargen


People also ask

What are whitelist characters?

A whitelist specifies a list of characters that the OCR engine is only allowed to recognize — if a character is not on the whitelist, it cannot be included in the output OCR results. The opposite of a whitelist is a blacklist. A blacklist specifies characters that under no circumstances can be included in the output.

Can Tesseract recognize numbers?

Python Tesseract 4.0 OCR: Recognize only Numbers / Digits and exclude all other Characters. Googles Tesseract (originally from HP) is one of the most popular, free Optical Character Recognition (OCR) software out there. It can be used with several programming languages because many wrappers exist for this project.

How does Tesseract recognize text from images?

Optical Character Recognition (OCR) is a technology that is used to recognize text from images. It can be used to convert tight handwritten or printed texts into machine-readable texts. To use OCR, you need to install and configure tesseract on your computer. First, download the Tesseract OCR executables here.


1 Answers

Create a config file (e.g "letters") in tessdata/configs directory - usually /usr/share/tesseract/tessdata/configs
or
/usr/share/tesseract-ocr/tessdata/configs

And add this line to the config file:

tessedit_char_whitelist abcdefghijklmnopqrstuvwxyz 

...or maybe [a-z] works. I don't know. Then call tesseract similar to this:

tesseract input.tif output nobatch letters   

That will limit tesseract to recognize only the wanted characters.

like image 135
Blomman Avatar answered Sep 26 '22 17:09

Blomman