Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improve Tesseract OCR results with blurred text

I am working on OCR recognition of printed text. In particular I am focusing on the preprocessing step to improve the results of the Tesseract engine. I have already obtained good results with adaptive thresholding, noise removal, text deskew, etc... But still Tesseract seems to fail when other commercial product return decent results.

I used the following test image and here are the results obtained with Tesseract 3.04 compared to two commercial OCR apis. All the 3 services were provided with the same binary image that contains some slightly blurred text.

Text image used to compared the 3 OCR products

Tesseract

Careers in Technology Consulting

Networking Lunch
21 m 2014, 11:00 - 14:30

Definingthecorporatellstmtegy, Wammmwdngdeal, creating
uniquebwinessisighnwilgbigdam-doesflismflxemmyouafioy?

Findoutmoreabanhowitfeektomkasatedlflogymbyjoiningour

for further mm please visit mAeloittexom/weers

ABBYY Fine Reader Online

Careers in Technology Consulting
Networking Lunch
21 November 2014,1140-14:30
Defining the corporate IT strategy, planning a multHnKon <Mar outsourcing deal, creating unique business insights using big data-doesthis sound Ifce something you enjoy?
Find out more about hour it feels to work as a technology consultant by joining our exclusive networking lunch,
For further information please visit wrwMuleloittexom/carcert

Online OCR

Careers in Technology Consulting Networking Lunch 21 November 2014, 11;00 —14:30 
Defining the corporate IT strategy, planning a muiti-indlimi dollar outsourcing deal, creating unique business insights using big data—does this sound like something you enjoy? 
Find out more about how it feels to work as a tedmology consultant by joining our exclusive networking lunch, 
For further information' please visit wwwdeloitte,com/careers 

Now I wonder whether the big gap between Tesseract and the other two products is due to a different engine (for sure ABBYY uses its own engine, not sure about OCR Web Service) or there are some other preprocessing steps that can be done before running Tesseract. Do you have any suggestions?

like image 761
Marco Ancona Avatar asked Dec 27 '14 21:12

Marco Ancona


People also ask

What is better than Tesseract OCR?

Google does well on the scanned email and recognizes the text in the smartphone-captured document similarly well as ABBYY. However it is much better than Tesseract or ABBYY in recognizing handwriting, as the second result image shows: still far from perfect, but at least it got some things right.

Is Tesseract good for OCR?

While Tesseract is known as one of the most accurate free OCR engines available today, it has numerous limitations that dramatically affect its performance; its ability to correctly recognize characters in a scan or image.


1 Answers

Here a suggestion for "magic" OCR preprocessing. In order to explain the principle of the proposed preprocessing idea, let's consider an excerpt from the provided text image on which all of the tested OCRs failed :

original image

and apply to it some "preprocessing-wisdom". First the usual thresholding:

thresholded image

and then some "magic" by shooting vertical lines through word-elements, detecting max. 2 pixel high "bars" and cutting them at their edges along with cutting the word-element down to its bottom line:

after extracting "i"s

Now switching from shooting lines through the word-elements in this image from vertical to horizontal ones in order to detect very wide "bars" and cut them vertical in the middle of their width:

after splitting grown-together characters

This should help any OCR-engine to provide better results on this particular image. I can imagine that some of the commercial OCR-engines use this approach already being able to provide a better recognition than this ones tested.

In this context let me mention another free OCR-engines available in the Ubuntu repositories (comparable with tesseract). Testing them against each other you can wonder even more how it comes that they provide different results and then look into their source code to know :) and infer from this experience something about the commercial ones.

sudo apt-get install cuneiform gocr ocrad
like image 111
Claudio Avatar answered Oct 06 '22 22:10

Claudio