I'm using tesseract OCRwith python-tesseract. In the tesseract FAQ, regarding digits, we have:
Use
TessBaseAPI::SetVariable("tessedit_char_whitelist", "0123456789");
BEFORE calling an Init function or put this in a text file called tessdata/configs/digits:
tessedit_char_whitelist 0123456789
and then your command line becomes:
tesseract image.tif outputbase nobatch digits
Warning: Until the old and new config variables get merged, you must have the nobatch parameter too.
In python-tesseract, the SetVariable method exists. I've tried this, but the result of the OCR is the same:
api = tesseract.TessBaseAPI()
api.SetVariable("tessedit_char_whitelist", "0123456789")
api.Init('.','eng',tesseract.OEM_DEFAULT)
api.SetPageSegMode(tesseract.PSM_AUTO)
Did anyone already got this working, or should I consider it a bug in python-tesseract?
OK, got it working. According to this (unofficial ?) documentation of tesseract-ocr, SetVariable() must be called after Init(), even though the opposite is said in the official FAQ. Calling it after Init() works as intended.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With