I have some questions about making tiff/box files for tesseract 4. In TrainingTesseract 4.00 document written:
Making Box Files As with base Tesseract, there is a choice between rendering synthetic training data from fonts, or labeling some pre-existing images (like ancient manuscripts for example).
But it did not explain how to train with pre-existing images.
I want to train for the Persian language in tesseract 4 (lstm). I have some images from ancient manuscripts and want to train with images and texts instead of font. So I can’t use text2image
command. I know that the old format box files will not work for LSTM training.
The DS team is tasked with training a tesseract OCR model, an open-source OCR, as an alternative to Google vision. Tesseract OCR takes in segmented handwritten images and their corresponding transcribed texts (ground truth). The pair need to have the same name <name>. tif for the image or <name>.
Go to this tesseract repository and download the respective 32-bit or 64-bit .exe installer. Install this in a system path like “C:\Program Files\Tesseract-OCR.” Go to your settings and add this path to your environment variables. Go to your command prompt and type “tesseract.exe” to verify the installation.
I was struggling just like you, until I found this github repository: https://github.com/OCR-D/ocrd-train
It will make your life super easy. All you need to do is to put your images in tif format and your text should have the same image name with extension .gt.txt. It will take care of all the rest for you. (you might need to update the Makefile according to your local machine)
Whether to train from scratch or fine-tune depends on your own language, data and the problem you are trying to solve. For me the fine tunining is what I need cause I am happy with the current performance but need to add upon it.
All the useful details you might need can be found in this answer
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With