I'm trying to train Tesseract 4 with images instead of fonts.
In the docs they are explaining only the approach with fonts, not with images.
I know how it works, when I use a prior version of Tesseract but I didn't get how to use the box/tiff files to train with LSTM in Tesseract 4.
I looked into tesstrain.sh, which is used to generate LSTM training data but couldn't find anything helpful. Any ideas?
After that move the traineddata file in your tessdata folder. To use tesseract with the new font in Python or any other language (I think?) put lang = "Font" as second parameter in image_to_string function. It improves accuracy significantly but can still make mistakes ofcourse.
Luckily, you can train your Tesseract so it can read your font easily.
Go to this tesseract repository and download the respective 32-bit or 64-bit .exe installer. Install this in a system path like “C:\Program Files\Tesseract-OCR.” Go to your settings and add this path to your environment variables. Go to your command prompt and type “tesseract.exe” to verify the installation.
Clone the tesstrain repo at https://github.com/tesseract-ocr/tesstrain.
You’ll also need to clone the tessdata_best repo, https://github.com/tesseract-ocr/tessdata_best. This acts as the starting point for your training. It takes hundreds of thousands of samples of training data to get accuracy, so using a good starting point lets you fine-tune your training with much less data (~tens to hundreds of samples can be enough)
Add your training samples to the directory in the tesstrain repo named ./tesstrain/data/my-custom-model-ground-truth
Your training samples should be image/text file pairs that share the same name but different extensions. For example, you should have an image file named 001.png
that is a picture of the text foobar
and you should have a text file named 001.gt.txt
that has the text foobar
.
These files need to be single lines of text.
In the tesstrain
repo, run this command:
make training MODEL_NAME=my-custom-model START_MODEL=eng TESSDATA=~/src/tessdata_best
Once the training is complete, there will be a new file tesstrain/data/.traineddata. Copy that file to the directory Tesseract searches for models. On my machine, it was /usr/local/share/tessdata/.
Then, you can run tesseract and use that model as a language.
tesseract -l my-custom-model foo.png -
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With