I am working on OCR recognition of printed text. In particular I am focusing on the preprocessing step to improve the results of the Tesseract engine. I have already obtained good results with adaptive thresholding, noise removal, text deskew, etc... But still Tesseract seems to fail when other commercial product return decent results.
I used the following test image and here are the results obtained with Tesseract 3.04 compared to two commercial OCR apis. All the 3 services were provided with the same binary image that contains some slightly blurred text.
Tesseract
Careers in Technology Consulting
Networking Lunch
21 m 2014, 11:00 - 14:30
Definingthecorporatellstmtegy, Wammmwdngdeal, creating
uniquebwinessisighnwilgbigdam-doesflismflxemmyouafioy?
Findoutmoreabanhowitfeektomkasatedlflogymbyjoiningour
for further mm please visit mAeloittexom/weers
ABBYY Fine Reader Online
Careers in Technology Consulting
Networking Lunch
21 November 2014,1140-14:30
Defining the corporate IT strategy, planning a multHnKon <Mar outsourcing deal, creating unique business insights using big data-doesthis sound Ifce something you enjoy?
Find out more about hour it feels to work as a technology consultant by joining our exclusive networking lunch,
For further information please visit wrwMuleloittexom/carcert
Online OCR
Careers in Technology Consulting Networking Lunch 21 November 2014, 11;00 —14:30
Defining the corporate IT strategy, planning a muiti-indlimi dollar outsourcing deal, creating unique business insights using big data—does this sound like something you enjoy?
Find out more about how it feels to work as a tedmology consultant by joining our exclusive networking lunch,
For further information' please visit wwwdeloitte,com/careers
Now I wonder whether the big gap between Tesseract and the other two products is due to a different engine (for sure ABBYY uses its own engine, not sure about OCR Web Service) or there are some other preprocessing steps that can be done before running Tesseract. Do you have any suggestions?
Google does well on the scanned email and recognizes the text in the smartphone-captured document similarly well as ABBYY. However it is much better than Tesseract or ABBYY in recognizing handwriting, as the second result image shows: still far from perfect, but at least it got some things right.
While Tesseract is known as one of the most accurate free OCR engines available today, it has numerous limitations that dramatically affect its performance; its ability to correctly recognize characters in a scan or image.
Here a suggestion for "magic" OCR preprocessing. In order to explain the principle of the proposed preprocessing idea, let's consider an excerpt from the provided text image on which all of the tested OCRs failed :
and apply to it some "preprocessing-wisdom". First the usual thresholding:
and then some "magic" by shooting vertical lines through word-elements, detecting max. 2 pixel high "bars" and cutting them at their edges along with cutting the word-element down to its bottom line:
Now switching from shooting lines through the word-elements in this image from vertical to horizontal ones in order to detect very wide "bars" and cut them vertical in the middle of their width:
This should help any OCR-engine to provide better results on this particular image. I can imagine that some of the commercial OCR-engines use this approach already being able to provide a better recognition than this ones tested.
In this context let me mention another free OCR-engines available in the Ubuntu repositories (comparable with tesseract). Testing them against each other you can wonder even more how it comes that they provide different results and then look into their source code to know :) and infer from this experience something about the commercial ones.
sudo apt-get install cuneiform gocr ocrad
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With