Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Training tesseract 4 with images instead of font

Tags:

tesseract

I have some questions about making tiff/box files for tesseract 4. In TrainingTesseract 4.00 document written:

Making Box Files As with base Tesseract, there is a choice between rendering synthetic training data from fonts, or labeling some pre-existing images (like ancient manuscripts for example).

But it did not explain how to train with pre-existing images.

I want to train for the Persian language in tesseract 4 (lstm). I have some images from ancient manuscripts and want to train with images and texts instead of font. So I can’t use text2image command. I know that the old format box files will not work for LSTM training.

  1. How can I make tif/box for tessearct 4 lstm then label them and how to change tesseract commands?
  2. Should I use other tools for generating box files (Given that Persian language is right to left )?
  3. Should I use fine tuning or train from Scratch?
like image 977
M.Rahnama Avatar asked Jun 28 '18 10:06

M.Rahnama


People also ask

Can we train Tesseract OCR?

The DS team is tasked with training a tesseract OCR model, an open-source OCR, as an alternative to Google vision. Tesseract OCR takes in segmented handwritten images and their corresponding transcribed texts (ground truth). The pair need to have the same name <name>. tif for the image or <name>.

How do I retrain Tesseract OCR in Python?

Go to this tesseract repository and download the respective 32-bit or 64-bit .exe installer. Install this in a system path like “C:\Program Files\Tesseract-OCR.” Go to your settings and add this path to your environment variables. Go to your command prompt and type “tesseract.exe” to verify the installation.


1 Answers

I was struggling just like you, until I found this github repository: https://github.com/OCR-D/ocrd-train

It will make your life super easy. All you need to do is to put your images in tif format and your text should have the same image name with extension .gt.txt. It will take care of all the rest for you. (you might need to update the Makefile according to your local machine)

Whether to train from scratch or fine-tune depends on your own language, data and the problem you are trying to solve. For me the fine tunining is what I need cause I am happy with the current performance but need to add upon it.

All the useful details you might need can be found in this answer

like image 95
Raniem Avatar answered Oct 25 '22 11:10

Raniem