Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Explicitly set the font to be used for recognition by Tesseract-OCR

I have documents which use only one font throughout the document. Different documents might have different fonts, but I know which document uses which font.

Is there an option to explicitly tell Tesseract-OCR which font to use during recognition for a given image?

like image 531
sashoalm Avatar asked Oct 31 '12 08:10

sashoalm


People also ask

Why is the Tesseract OCR not accurate?

Inevitably, noise in an input image, non-standard fonts that Tesseract wasn't trained on, or less than ideal image quality will cause Tesseract to make a mistake and incorrectly OCR a piece of text. When that happens, you need to create rules and heuristics that can be used to improve the output OCR quality.

What is the best font for scanning?

The OCR engine is capable of recognizing text with many different fonts. However, standard fonts, such as Arial and New Times Roman, provide better recognition results than fonts that have more unusual character shapes.

What is OEM and PSM in Tesseract?

The --oem argument, or OCR Engine Mode, controls the type of algorithm used by Tesseract. The --psm controls the automatic Page Segmentation Mode used by Tesseract.


1 Answers

No, I don't think Tesseract supports such an option. What you can do is to train for one specific font and then specify that traineddata during recognition of your documents.

like image 188
nguyenq Avatar answered Sep 21 '22 20:09

nguyenq