Tesseract Ocr Engine Cube mode - Training Tesseract

Question

Can you explain me what cube mode and Cube Data Files are on Tesseract ocr Engine and what is the advantage of using them?

And how can i train tesseract for Greek to have better results?

Siarhei Yakushevich · Accepted Answer

For those who might be still interested. On Tesseract's website, there are standard trained data sets for different files.

https://code.google.com/p/tesseract-ocr/downloads/list?num=100&start=100

Procedure for training is described here (for version 3.01)

https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

In the case of Cube, there is another engine in comparison with Tesseract. It consumes more resources, slower, but gives better results.

Data files -set of files, that should finally lead(be merged into) to a trained data file.

pvorb · Answer

There is an explanation of the various training files required by the Cube engine mode on the tesseract-ocr-extradocs project wiki:

https://code.google.com/p/tesseract-ocr-extradocs/wiki/Cube

There you can find detailed (but incomplete) information on how to create the necessary files for training in Cube mode. There's also some information on the neural network file format that might be useful:

https://code.google.com/p/tesseract-ocr-extradocs/wiki/nnFileFormat

Cube mode will often give you better recognition results by using neural networks instead of the adaptive classifier.

I never created Cube training files on my own, so I can't give you more detailed information on how to create these files.

Pranav · Answer

For Tesseract 4+ (with LSTM)

I'm not completely sure about cube mode but with --oem 1 you can enable the new LSTM engine and take advantage of the following solutions:

Use the existing models

I would recommend using the pre-trained models available on the Tesseract GitHub repo. They've got a wide variety of languages (and it looks like greek is supported too!)
Train it yourself

I haven't tried this myself but the relevant Wiki on GitHub looks solid.

tl-dr

git clone git@github.com:tesseract-ocr/tessdata.git
Select the language file you want
Move it into your project's tessdata directory

Tesseract Ocr Engine Cube mode - Training Tesseract

Tags:

ocr

tesseract

cube

George Melidis

3 Answers

Siarhei Yakushevich

pvorb

For Tesseract 4+ (with LSTM)

Use the existing models

Train it yourself

tl-dr

Pranav

Recent Activity

Donate For Us

Tesseract Ocr Engine Cube mode - Training Tesseract

Tags:

ocr

tesseract

cube

George Melidis

3 Answers

Siarhei Yakushevich

pvorb

For Tesseract 4+ (with LSTM)

Use the existing models

Train it yourself

tl-dr

Pranav

Related questions

Recent Activity

Donate For Us