Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tesseract Ocr Engine Cube mode - Training Tesseract

Can you explain me what cube mode and Cube Data Files are on Tesseract ocr Engine and what is the advantage of using them?

And how can i train tesseract for Greek to have better results?

like image 559
George Melidis Avatar asked May 16 '13 14:05

George Melidis


3 Answers

For those who might be still interested. On Tesseract's website, there are standard trained data sets for different files.

https://code.google.com/p/tesseract-ocr/downloads/list?num=100&start=100

Procedure for training is described here (for version 3.01)

https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

In the case of Cube, there is another engine in comparison with Tesseract. It consumes more resources, slower, but gives better results.

Data files -set of files, that should finally lead(be merged into) to a trained data file.

like image 179
Siarhei Yakushevich Avatar answered Sep 27 '22 02:09

Siarhei Yakushevich


There is an explanation of the various training files required by the Cube engine mode on the tesseract-ocr-extradocs project wiki:

https://code.google.com/p/tesseract-ocr-extradocs/wiki/Cube

There you can find detailed (but incomplete) information on how to create the necessary files for training in Cube mode. There's also some information on the neural network file format that might be useful:

https://code.google.com/p/tesseract-ocr-extradocs/wiki/nnFileFormat

Cube mode will often give you better recognition results by using neural networks instead of the adaptive classifier.

I never created Cube training files on my own, so I can't give you more detailed information on how to create these files.

like image 45
pvorb Avatar answered Sep 23 '22 02:09

pvorb


For Tesseract 4+ (with LSTM)

I'm not completely sure about cube mode but with --oem 1 you can enable the new LSTM engine and take advantage of the following solutions:

  • Use the existing models

    I would recommend using the pre-trained models available on the Tesseract GitHub repo. They've got a wide variety of languages (and it looks like greek is supported too!)

  • Train it yourself

    I haven't tried this myself but the relevant Wiki on GitHub looks solid.

tl-dr

  • git clone [email protected]:tesseract-ocr/tessdata.git
  • Select the language file you want
  • Move it into your project's tessdata directory
like image 41
Pranav Avatar answered Sep 26 '22 02:09

Pranav