I'm using tesseract-ocr-3.01 to scan many forms. The forms all follow a template, so I already know where the regions/rectangles of text are.
Is there a way to pass those regions to tesseract when using the command-line tool?
Inevitably, noise in an input image, non-standard fonts that Tesseract wasn't trained on, or less than ideal image quality will cause Tesseract to make a mistake and incorrectly OCR a piece of text.
Tesseract attempts to apply automatic page segmentation methods, but due to the fact that there is no actual “page” of text, the default --psm 3 fails and returns an empty string. We can resolve the matter by treating the input image as a single character via --psm 10 : $ tesseract number.png stdout --psm 10 2.
Combinations of the first three preprocessing actions are said to boost the accuracy of Tesseract 4.0 from 70.2% to 92.9%.
I found the answer, thanks to this thread.
It seems that tesseract suports the uzn format (used in the unvl tests).
From the thread:
Calling tesseract with parameter "-psm 4" and renaming the uzn file with the same name of the image seem works.
Example: If we have C:\input.tif
and C:\input.uzn
, we do this:
tesseract -psm 4 C:\input.tif C:\output
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With