I have documents which use only one font throughout the document. Different documents might have different fonts, but I know which document uses which font.
Is there an option to explicitly tell Tesseract-OCR which font to use during recognition for a given image?
Inevitably, noise in an input image, non-standard fonts that Tesseract wasn't trained on, or less than ideal image quality will cause Tesseract to make a mistake and incorrectly OCR a piece of text. When that happens, you need to create rules and heuristics that can be used to improve the output OCR quality.
The OCR engine is capable of recognizing text with many different fonts. However, standard fonts, such as Arial and New Times Roman, provide better recognition results than fonts that have more unusual character shapes.
The --oem argument, or OCR Engine Mode, controls the type of algorithm used by Tesseract. The --psm controls the automatic Page Segmentation Mode used by Tesseract.
No, I don't think Tesseract supports such an option. What you can do is to train for one specific font and then specify that traineddata during recognition of your documents.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With