Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating a training image for Tesseract OCR

Tags:

ocr

tesseract

I'm writing a generator for training images for Tesseract OCR.

When generating a training image for a new font for Tesseract OCR, what are the best values for:

  1. The DPI
  2. The font size in points
  3. Should the font be anti-aliased or not
  4. Should the bounding boxes fit snugly: enter image description here, or not: enter image description here
like image 236
sashoalm Avatar asked Nov 16 '12 10:11

sashoalm


People also ask

Can we train Tesseract OCR?

Luckily, you can train your Tesseract so it can read your font easily.


1 Answers

The 2th question is somehow answered here: http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Generate_Training_Images There is no need to train with multiple sizes. 10 point will do. (An exception to this is very small text. If you want to recognize text with an x-height smaller than about 15 pixels, you should either train it specifically or scale your images before trying to recognize them.)

Questions 1 and 3: by experience, I've successfully used 300 dpi images/non anti-aliased fonts. More specifically, I have used the following convert parameters on a training pdf, which generated a satisfactory image:

convert -density 300 -depth 8 [input].pdf -background white -flatten +matte -compress none -monochrome [output].tif

But then I tried to add a dotted font to Tesseract and it only detected characters properly when I used a 150 dpi image. So, I don't think there's a general solution, it depends on the kind of fonts you're trying to add.

like image 189
Luiza Utsch Avatar answered Sep 19 '22 01:09

Luiza Utsch