Creating a training image for Tesseract OCR

1 Answers

The 2th question is somehow answered here: http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Generate_Training_Images There is no need to train with multiple sizes. 10 point will do. (An exception to this is very small text. If you want to recognize text with an x-height smaller than about 15 pixels, you should either train it specifically or scale your images before trying to recognize them.)

Questions 1 and 3: by experience, I've successfully used 300 dpi images/non anti-aliased fonts. More specifically, I have used the following convert parameters on a training pdf, which generated a satisfactory image:

convert -density 300 -depth 8 [input].pdf -background white -flatten +matte -compress none -monochrome [output].tif

But then I tried to add a dotted font to Tesseract and it only detected characters properly when I used a 150 dpi image. So, I don't think there's a general solution, it depends on the kind of fonts you're trying to add.

189

answered Sep 19 '22 01:09

Luiza Utsch

Related questions
                            
                                unicharset_extractor: command not found
                            
                                Pytesseract foreign language extraction using python
                            
                                Recognizing text from a picture in delphi
                            
                                Abbyy Finereader command line usage / python usage?
                            
                                JavaScript text recognition and OCR on <canvas> [closed]
                            
                                How does card.io image processing work?
                            
                                Can tesseract be trained for non-font symbols?
                            
                                Apple Vision – Can't recognize a single number as region
                            
                                How to separate title and headers from body text in image
                            
                                Check image quality before OCR
                            
                                how to get character position in pytesseract
                            
                                Image preprocessing for egg marking recognition with Tesseract
                            
                                How do I find all image-based PDFs?
                            
                                Read text from image iPhone SDK [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Creating a training image for Tesseract OCR

Tags:

ocr

tesseract

sashoalm

People also ask

1 Answers

Luiza Utsch

Recent Activity

Donate For Us