Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tesseract: Specifying regions of text

Tags:

ocr

tesseract

I'm using tesseract-ocr-3.01 to scan many forms. The forms all follow a template, so I already know where the regions/rectangles of text are.

Is there a way to pass those regions to tesseract when using the command-line tool?

like image 941
sashoalm Avatar asked Oct 19 '12 09:10

sashoalm


People also ask

Why is the Tesseract OCR not accurate?

Inevitably, noise in an input image, non-standard fonts that Tesseract wasn't trained on, or less than ideal image quality will cause Tesseract to make a mistake and incorrectly OCR a piece of text.

What is Tesseract page segmentation mode?

Tesseract attempts to apply automatic page segmentation methods, but due to the fact that there is no actual “page” of text, the default --psm 3 fails and returns an empty string. We can resolve the matter by treating the input image as a single character via --psm 10 : $ tesseract number.png stdout --psm 10 2.

How accurate is Tesseract OCR?

Combinations of the first three preprocessing actions are said to boost the accuracy of Tesseract 4.0 from 70.2% to 92.9%.


1 Answers

I found the answer, thanks to this thread.

It seems that tesseract suports the uzn format (used in the unvl tests).

From the thread:

Calling tesseract with parameter "-psm 4" and renaming the uzn file with the same name of the image seem works.

Example: If we have C:\input.tif and C:\input.uzn, we do this:

tesseract -psm 4 C:\input.tif C:\output
like image 187
sashoalm Avatar answered Nov 06 '22 04:11

sashoalm