Train tesseract to one specific font

Tags:

I'm using tesseract to OCR text from a screen-scraper application. The only font used is the Segoe UI 8 CLEARTYPE QUALITY (see image below). At this moment tesseract is doing a poor job, mixing Z and 2, 0 and o and so on.

I've tried to scale up the text image (no improvements). Looking at eng.traineddata I can see that tesseract is not trained with Segoe UI 8 CLEARTYPE QUALITY.

Question: How can I train tesseract with a new font and specify that only that font should be used?

enter image description here

603

asked Mar 12 '18 14:03

Vingtoft

1 Answers

Please provide an example of your effort. My goal is to help you reach your goal, not to do the work for you.

This is quite a common problem and lots of people have solved this, some more efficiently than others. You can use the tools that they have created.

An example

code: https://github.com/ValYouW/ml-ocr-tool
video tutorial: https://www.youtube.com/watch?v=7uc05vyjVuw&t=631s

There are multiple others, some of them do just typefaces and are optimized for that. It might be something that is more impactful for you. For example:

https://www.youtube.com/watch?v=i_1-hGsXxy8

There are other examples, but most of them use image magic and other tools to improve the initial input data quality so that the OCR tool does its best. Personally, I wrote efficient c# GDI transformations to manipulate the input data before I run Tesseract on it.

answered Oct 18 '22 02:10

Margus

Related questions
                            
                                How to embed external OCR into existing PDF?
                            
                                OCR on Windows Phone 8 WP8
                            
                                tesseract v3.03 render PDF with searchable text example
                            
                                Improve pre-processing steps in Tesseract OCR for realtime capture
                            
                                OCR on text stamped into metal plate
                            
                                Getting an error when using the image_to_osd method with pytesseract
                            
                                How to extract charts/tables/graphs from PDF files using Python?
                            
                                Cocos2d handwriting recognition....HOW...?
                            
                                Using HMM for offline character recognition
                            
                                Digit Recognition on CNN

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Train tesseract to one specific font

Tags:

ocr

tesseract

Vingtoft

People also ask

1 Answers

Margus

Recent Activity

Donate For Us