Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Train tesseract to one specific font

Tags:

ocr

tesseract

I'm using tesseract to OCR text from a screen-scraper application. The only font used is the Segoe UI 8 CLEARTYPE QUALITY (see image below). At this moment tesseract is doing a poor job, mixing Z and 2, 0 and o and so on.

I've tried to scale up the text image (no improvements). Looking at eng.traineddata I can see that tesseract is not trained with Segoe UI 8 CLEARTYPE QUALITY.

Question: How can I train tesseract with a new font and specify that only that font should be used?

enter image description here

like image 603
Vingtoft Avatar asked Mar 12 '18 14:03

Vingtoft


People also ask

Can Tesseract be trained?

Luckily, you can train your Tesseract so it can read your font easily.

How do you specify a language in Tesseract?

We can do this by supplying the --lang or -l command line argument, specifying the language we want Tesseract to use when OCR'ing. Here, I am OCR'ing a file named german. png where the -l parameter indicates that I want Tesseract to OCR German text ( deu ).

Can OCR be trained?

The optical character recognition (OCR) app trains the ocr function to recognize a custom language or font. You can use this app to label character data interactively for OCR training and to generate an OCR language data file for use with the ocr function.


1 Answers

Please provide an example of your effort. My goal is to help you reach your goal, not to do the work for you.

This is quite a common problem and lots of people have solved this, some more efficiently than others. You can use the tools that they have created.

An example

  • code: https://github.com/ValYouW/ml-ocr-tool
  • video tutorial: https://www.youtube.com/watch?v=7uc05vyjVuw&t=631s ocr

There are multiple others, some of them do just typefaces and are optimized for that. It might be something that is more impactful for you. For example:

  • https://www.youtube.com/watch?v=i_1-hGsXxy8 enter image description here

There are other examples, but most of them use image magic and other tools to improve the initial input data quality so that the OCR tool does its best. Personally, I wrote efficient c# GDI transformations to manipulate the input data before I run Tesseract on it.

like image 80
Margus Avatar answered Oct 18 '22 02:10

Margus