I'm using tesseract to OCR text from a screen-scraper application. The only font used is the Segoe UI 8 CLEARTYPE QUALITY (see image below). At this moment tesseract is doing a poor job, mixing Z and 2, 0 and o and so on.
I've tried to scale up the text image (no improvements). Looking at eng.traineddata I can see that tesseract is not trained with Segoe UI 8 CLEARTYPE QUALITY.
Question: How can I train tesseract with a new font and specify that only that font should be used?
Luckily, you can train your Tesseract so it can read your font easily.
We can do this by supplying the --lang or -l command line argument, specifying the language we want Tesseract to use when OCR'ing. Here, I am OCR'ing a file named german. png where the -l parameter indicates that I want Tesseract to OCR German text ( deu ).
The optical character recognition (OCR) app trains the ocr function to recognize a custom language or font. You can use this app to label character data interactively for OCR training and to generate an OCR language data file for use with the ocr function.
Please provide an example of your effort. My goal is to help you reach your goal, not to do the work for you.
This is quite a common problem and lots of people have solved this, some more efficiently than others. You can use the tools that they have created.
An example
There are multiple others, some of them do just typefaces and are optimized for that. It might be something that is more impactful for you. For example:
There are other examples, but most of them use image magic and other tools to improve the initial input data quality so that the OCR tool does its best. Personally, I wrote efficient c# GDI transformations to manipulate the input data before I run Tesseract on it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With