Training tesseract 4 with images instead of font

Tags:

tesseract

I have some questions about making tiff/box files for tesseract 4. In TrainingTesseract 4.00 document written:

Making Box Files As with base Tesseract, there is a choice between rendering synthetic training data from fonts, or labeling some pre-existing images (like ancient manuscripts for example).

But it did not explain how to train with pre-existing images.

I want to train for the Persian language in tesseract 4 (lstm). I have some images from ancient manuscripts and want to train with images and texts instead of font. So I can’t use text2image command. I know that the old format box files will not work for LSTM training.

How can I make tif/box for tessearct 4 lstm then label them and how to change tesseract commands?
Should I use other tools for generating box files (Given that Persian language is right to left )?
Should I use fine tuning or train from Scratch?

977

asked Jun 28 '18 10:06

M.Rahnama

1 Answers

I was struggling just like you, until I found this github repository: https://github.com/OCR-D/ocrd-train

It will make your life super easy. All you need to do is to put your images in tif format and your text should have the same image name with extension .gt.txt. It will take care of all the rest for you. (you might need to update the Makefile according to your local machine)

Whether to train from scratch or fine-tune depends on your own language, data and the problem you are trying to solve. For me the fine tunining is what I need cause I am happy with the current performance but need to add upon it.

All the useful details you might need can be found in this answer

answered Oct 25 '22 11:10

Raniem

Related questions
                            
                                Make tesseract recognise numbers only
                            
                                How to implement Tesseract to run with project in Visual Studio 2010
                            
                                configure: error: leptonica library missing (when building tesseract-ocr-3.01 on MinGW)
                            
                                Strength of Dictionary in Tesseract 3
                            
                                Extracting paragraph breaks from OCR text?
                            
                                Tesseract does not recognize german "für"
                            
                                How to detect subscript numbers in an image using OCR?
                            
                                Tesseract OCR Text Position
                            
                                How to detect Text Area from image?
                            
                                Android JNI DETECTED ERROR IN APPLICATION: JNI GetMethodID called with pending exception
                            
                                Python Tesseract can't recognize this font
                            
                                Can I test tesseract ocr in windows command line?
                            
                                Installing Tesseract-OCR on CentOS 6
                            
                                Unable to load library 'tesseract': libtesseract.so: cannot open shared object file: No such file or directory
                            
                                Android: How to improve the numbers within the image retrieved by tesseract ocr?
                            
                                get the exact position of text from image in tesseract
                            
                                Tesseract - ERROR net.sourceforge.tess4j.Tesseract - null
                            
                                How to recognize MICR codes in Android

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With