Can you explain me what cube mode and Cube Data Files are on Tesseract ocr Engine and what is the advantage of using them?
And how can i train tesseract for Greek to have better results?
For those who might be still interested. On Tesseract's website, there are standard trained data sets for different files.
https://code.google.com/p/tesseract-ocr/downloads/list?num=100&start=100
Procedure for training is described here (for version 3.01)
https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
In the case of Cube, there is another engine in comparison with Tesseract. It consumes more resources, slower, but gives better results.
Data files -set of files, that should finally lead(be merged into) to a trained data file.
There is an explanation of the various training files required by the Cube engine mode on the tesseract-ocr-extradocs project wiki:
https://code.google.com/p/tesseract-ocr-extradocs/wiki/Cube
There you can find detailed (but incomplete) information on how to create the necessary files for training in Cube mode. There's also some information on the neural network file format that might be useful:
https://code.google.com/p/tesseract-ocr-extradocs/wiki/nnFileFormat
Cube mode will often give you better recognition results by using neural networks instead of the adaptive classifier.
I never created Cube training files on my own, so I can't give you more detailed information on how to create these files.
I'm not completely sure about cube mode but with --oem 1
you can enable the new LSTM engine and take advantage of the following solutions:
I would recommend using the pre-trained models available on the Tesseract GitHub repo. They've got a wide variety of languages (and it looks like greek is supported too!)
I haven't tried this myself but the relevant Wiki on GitHub looks solid.
git clone [email protected]:tesseract-ocr/tessdata.git
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With