Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python-tesseract OCR: get digits only

I'm using tesseract OCRwith python-tesseract. In the tesseract FAQ, regarding digits, we have:

Use

TessBaseAPI::SetVariable("tessedit_char_whitelist", "0123456789");

BEFORE calling an Init function or put this in a text file called tessdata/configs/digits:

tessedit_char_whitelist 0123456789

and then your command line becomes:

tesseract image.tif outputbase nobatch digits

Warning: Until the old and new config variables get merged, you must have the nobatch parameter too.

In python-tesseract, the SetVariable method exists. I've tried this, but the result of the OCR is the same:

api = tesseract.TessBaseAPI()
api.SetVariable("tessedit_char_whitelist", "0123456789")
api.Init('.','eng',tesseract.OEM_DEFAULT)
api.SetPageSegMode(tesseract.PSM_AUTO)

Did anyone already got this working, or should I consider it a bug in python-tesseract?

like image 684
jpimentel Avatar asked Mar 20 '12 20:03

jpimentel


1 Answers

OK, got it working. According to this (unofficial ?) documentation of tesseract-ocr, SetVariable() must be called after Init(), even though the opposite is said in the official FAQ. Calling it after Init() works as intended.

like image 159
jpimentel Avatar answered Sep 29 '22 22:09

jpimentel