Sorry this must be probably a dumb question. but i am fairly new to machine learning and Tessaract OCR. I have heard that Tessaract OCR can be trained.
What i need to know is does Tessaract OCR uses neural networks as their default training mechanism or do we have to program it explicitly to use neural networks ?.
Sorry if i'm thinking in a wrong way about this "training" concept. but what i need to know exactly is is Tessaract already using NN or if not how i can approach using NN with tessaract OCR to improve recognition accuracy ?.
If one can please suggest me some good resources/way to refer/try and to get started it would be a great help too.
what i currently know about basic machine learning supervised training concept and to perform basic image OCR operation in Tessaract OCR.
The Tesseract 4.00 neural network subsystem is integrated into Tesseract as a line recognizer. It can be used with the existing layout analysis to recognize text within a large document, or it can be used in conjunction with an external text detector to recognize text from an image of a single textline.
The line finding algorithm is one of the few parts of Tesseract that has previously been published [3]. The line finding algorithm is designed so that a skewed page can be recognized without having to de-skew, thus saving loss of image quality.
Make a starter traineddata from the unicharset and optional dictionary data. Run tesseract to process image + box file to make training data set. Run training on training data set. Combine data files.
Tesseract 3. x is based on traditional computer vision algorithms. In the past few years, Deep Learning based methods have surpassed traditional machine learning techniques by a huge margin in terms of accuracy in many areas of Computer Vision. Handwriting recognition is one of the prominent examples.
It appears that Tessaract uses an Adaptive Classifier by default. Check this out for a good read:
https://github.com/tesseract-ocr/docs/blob/master/tesseracticdar2007.pdf
There appears to be an option called "Cube mode" where it will switch to using NNs for the learning system instead of the adaptive classifier (https://code.google.com/p/tesseract-ocr-extradocs/wiki/Cube). More info about adaptive classifiers:
http://www.cs.indiana.edu/~rawlins/website/adaptivity/information-helper.html
Also, related very closely is a Learning Classifier System:
http://en.wikipedia.org/wiki/Learning_classifier_system
Also, your terminology of "training" is very close. Training is how you teach the pattern recognition system or learning system what responses it should give to certain input sets. Then, it uses similarities when it encounters unknown data to classify the new data. Machine learning is one of the coolest fields in existence in my opinion (probably biased opinion but whatever!) keep up the learning! You are the meta learner: learning how to teach a machine to learn! Cool stuff!
Yes, starting from tesseract 4.0, it provides a new lstm-based ocr engine: https://tesseract-ocr.github.io/tessdoc/NeuralNetsInTesseract4.00
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With