Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make tesseract to recognize only numbers, when they are mixed with letters?

Tags:

ocr

tesseract

I want to use tesseract to recognize only numbers. The problem is that I have mixture of numbers & letters and when I use SetVariable("tessedit_char_whitelist", "0123456789")
for every symbol tesseract returns wrong digit.

Can I set a threshold value so that tesseract omits the symbols with low resemblance?

NOTE: I set tesseract to recognize only digits so there is no confusion between O and 0.

like image 949
zkunov Avatar asked Feb 09 '11 12:02

zkunov


People also ask

Can Tesseract detect numbers?

Python Tesseract 4.0 OCR: Recognize only Numbers / Digits and exclude all other Characters. Googles Tesseract (originally from HP) is one of the most popular, free Optical Character Recognition (OCR) software out there. It can be used with several programming languages because many wrappers exist for this project.

How does Tesseract recognize text from images?

Optical Character Recognition (OCR) is a technology that is used to recognize text from images. It can be used to convert tight handwritten or printed texts into machine-readable texts. To use OCR, you need to install and configure tesseract on your computer. First, download the Tesseract OCR executables here.

Why is the Tesseract OCR not accurate?

Inevitably, noise in an input image, non-standard fonts that Tesseract wasn't trained on, or less than ideal image quality will cause Tesseract to make a mistake and incorrectly OCR a piece of text. When that happens, you need to create rules and heuristics that can be used to improve the output OCR quality.

Can Tesseract detect language?

Unfortunately tesseract does not have a feature to detect language of the text in an image automatically. An alternative solution is provided by another python module called langdetect which can be installed via pip.


1 Answers

Recognizing only numbers is actually answered on the tesseract FAQ page. See that page for more info, but if you have the version 3 package, the config files are already set up. You just specify on the commandline:

tesseract image.tif outputbase nobatch digits 

As for the threshold value, I'm not sure which you mean. If your input is an unusual font, perhaps you might retrain with a sample of your input. An alternative is to change tesseract's pruning threshold. Both options are also mentioned in the FAQ.

like image 104
Jerry Avatar answered Oct 07 '22 17:10

Jerry