Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Increase Accuracy of text recognition through pytesseract & PIL

So I am trying to extract text from image. And as the quality and size of image is not good, it is giving inaccurate results. I tried few enhancements and other things with PIL but that is only worsening the quality of image.

Can someone suggest some enhancement in image to get better results. Few Examples of images:

two

three

like image 239
sprksh Avatar asked Apr 13 '17 01:04

sprksh


People also ask

How do you improve Tesseract OCR performance?

One of the first rules and heuristics you should look at is automatic spellchecking. For example, if you're OCR'ing a book, you could use spellchecking as an attempt to automatically correct after the OCR process, thereby creating a better, more accurate version of the digitized text.

What is the accuracy of Tesseract OCR?

The following results are presented for Tesseract: the original set of samples achieves a precision of 0.907 and 0.901 recall rate, while the preprocessed set leads to a precision of 0.929 and a recall of 0.928.


1 Answers

In the provided example of image the text is visually of quite good quality, so the question is how it comes that OCR gives inaccurate results?

To illustrate the conclusions given in further text of this answer let's run the the given image

enter image description here

through Tesseract. Below the result of Tesseract OCR:

"fhpgearedmomrs©gmachom"

Now let's resize the image four times and apply thresholding to it. I have done the resizing and thresholding manually in Gimp, but with appropriate resizing method and threshold value for PIL it can be for sure automated, so that after the enhancement you get an image similar to the enhanced image I have got:

enter image description here

The improved image run through Tesseract OCR gives following text:

"fhpgearedmotors©gmail.com"

This demonstrates that enlarging an image can help to achieve 100% accuracy on the provided text-image example.

It may appear weird that enlarging an image helps to achieve better OCR accuracy, BUT ... OCR was developed to convert scans of printed media to texts and expect 300 dpi images of the text by design. This explains why some OCR programs didn't resize the text by themselves to improve their results and do bad on small fonts expecting higher dpi resolution of the image which can be achieved by enlarging.

Here an excerpt from Tesseract FAQ on github.com prooving the statement above:

[There is a minimum text size for reasonable accuracy. You have to consider resolution as well as point size. Accuracy drops off below 10pt x 300dpi, rapidly below 8pt x 300dpi. A quick check is to count the pixels of the x-height of your characters. (X-height is the height of the lower case x.) At 10pt x 300dpi x-heights are typically about 20 pixels, although this can vary dramatically from font to font. Below an x-height of 10 pixels, you have very little chance of accurate results, and below about 8 pixels, most of the text will be "noise removed".]

like image 160
Claudio Avatar answered Oct 06 '22 03:10

Claudio