Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I train tesseract 4 with image data instead of a font file?

I'm trying to train Tesseract 4 with images instead of fonts.

In the docs they are explaining only the approach with fonts, not with images.

I know how it works, when I use a prior version of Tesseract but I didn't get how to use the box/tiff files to train with LSTM in Tesseract 4.

I looked into tesstrain.sh, which is used to generate LSTM training data but couldn't find anything helpful. Any ideas?

like image 233
claim Avatar asked Apr 11 '17 17:04

claim


People also ask

How do I change the font in Tesseract?

After that move the traineddata file in your tessdata folder. To use tesseract with the new font in Python or any other language (I think?) put lang = "Font" as second parameter in image_to_string function. It improves accuracy significantly but can still make mistakes ofcourse.

Can we train Tesseract?

Luckily, you can train your Tesseract so it can read your font easily.

How do I train the Tesseract in Windows?

Go to this tesseract repository and download the respective 32-bit or 64-bit .exe installer. Install this in a system path like “C:\Program Files\Tesseract-OCR.” Go to your settings and add this path to your environment variables. Go to your command prompt and type “tesseract.exe” to verify the installation.


1 Answers

Clone the tesstrain repo at https://github.com/tesseract-ocr/tesstrain.

You’ll also need to clone the tessdata_best repo, https://github.com/tesseract-ocr/tessdata_best. This acts as the starting point for your training. It takes hundreds of thousands of samples of training data to get accuracy, so using a good starting point lets you fine-tune your training with much less data (~tens to hundreds of samples can be enough)

Add your training samples to the directory in the tesstrain repo named ./tesstrain/data/my-custom-model-ground-truth

Your training samples should be image/text file pairs that share the same name but different extensions. For example, you should have an image file named 001.png that is a picture of the text foobar and you should have a text file named 001.gt.txt that has the text foobar.

These files need to be single lines of text.

In the tesstrain repo, run this command:

make training MODEL_NAME=my-custom-model START_MODEL=eng TESSDATA=~/src/tessdata_best

Once the training is complete, there will be a new file tesstrain/data/.traineddata. Copy that file to the directory Tesseract searches for models. On my machine, it was /usr/local/share/tessdata/.

Then, you can run tesseract and use that model as a language.

tesseract -l my-custom-model foo.png -

like image 191
Eric Ihli Avatar answered Oct 17 '22 20:10

Eric Ihli