Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pytesseract using tesseract 4.0 numbers only not working

Any one tried to get numbers only calling the latest version of tesseract 4.0 in python?

The below worked in 3.05 but still returns characters in 4.0, I tried removing all config files but the digits file and still didn't work; any help would be great:

im is an image of a date, black text white background:

import pytesseract
im =  imageOfDate
im = pytesseract.image_to_string(im, config='outputbase digits')
print(im)
like image 237
CuriousGeorge Avatar asked Oct 04 '17 21:10

CuriousGeorge


People also ask

How do you train a Pytesseract?

Go to this tesseract repository and download the respective 32-bit or 64-bit .exe installer. Install this in a system path like “C:\Program Files\Tesseract-OCR.” Go to your settings and add this path to your environment variables. Go to your command prompt and type “tesseract.exe” to verify the installation.

Does Pytesseract need Tesseract installed?

You can confirm that pytesseract is installed in your virtual environment by hopping into the Python REPL and trying to import it. pytesseract is installed. Great! But before we can use it, we need to install the tesseract application.

How does Pytesseract work?

Pytesseract or Python-tesseract is an OCR tool for python that also serves as a wrapper for the Tesseract-OCR Engine. It can read and recognize text in images and is commonly used in python ocr image to text use cases.


3 Answers

You can specify the numbers in the tessedit_char_whitelist as below as a config option.

ocr_result = pytesseract.image_to_string(image, lang='eng', boxes=False, \
           config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')

Hope this help.

like image 126
thewaywewere Avatar answered Sep 28 '22 10:09

thewaywewere


Using tessedit_char_whitelist flags with pytesseract did not work for me. However, one workaround is to use a flag that works, which is config='digits':

import pytesseract
text = pytesseract.image_to_string(pixels, config='digits')

where pixels is a numpy array of your image (PIL image should also work). This should force your pytesseract into returning only digits. Now, to customize what it returns, find your digits configuration file, on Windows mine was located here:

C:\Program Files (x86)\Tesseract-OCR\tessdata\configs

Open the digits file and add whatever characters you want. After saving and running pytesseract, it should return only those customized characters.

like image 32
Robert Harris Avatar answered Sep 28 '22 12:09

Robert Harris


You can specify the numbers in the tessedit_char_whitelist as below as a config option.

ocr_result = pytesseract.image_to_string(image, lang='eng',config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')
like image 20
Tejesh Teju Avatar answered Sep 28 '22 11:09

Tejesh Teju