Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Low success rate with pytesser? Is this an issue of noise, or is there something else that needs to be done?

I'm trying to detect a few uppercase characters from a screen shot. I convert it to black and white with PIL, and then using the code example from the PyTesser page, I run tesser.exe on the image:

from pytesser import *
image = Image.open('fnord.tif') 
print image_to_string(image)     

I'm using this image: http://i.imgur.com/so419.png

But it doesn't recognize it as an E, or really anything for that matter. I think that it's a clean enough capture? The noise at the top isn't throwing it off, right?

Is there something I'm missing?

like image 446
Zack Avatar asked Aug 12 '12 16:08

Zack


People also ask

Does Pytesseract need Tesseract?

Pytesseract or Python-tesseract is an OCR tool for python that also serves as a wrapper for the Tesseract-OCR Engine. It can read and recognize text in images and is commonly used in python ocr image to text use cases.

Is Tesseract-OCR good?

Tesseract does various image processing operations internally (using the Leptonica library) before doing the actual OCR. It generally does a very good job of this, but there will inevitably be cases where it isn't good enough, which can result in a significant reduction in accuracy.


1 Answers

If you are concerned about whether the noise is an issue then manually open the image in MSPaint or something similar, remove the noise and then run the new image through the OCR. This is the best way to learn how the OCR engine works and what confuses it and what doesn't. Every OCR engine works differently.

In this case it could be the small bits of noise are confusing the character zoning process as well. You should check the bounding box values returned from the OCR engine to see if the OCR engine is even looking in the correct location for your word or character.

Some OCR engines have options to remove noise from an image during the OCR process. This is often called depspeckle or noise removal. It would be possible to remove noise using Leptonica ( http://www.leptonica.org ) which is now part of the latest Tesseract images.

Screen fonts present a big challenge to OCR engines because the DPI is often very low. In the case of your 'E' there should be more than enough pixels to be recognised. The heavy stroke weight could be confusing the engine.

Also the commercial engines will usually be more accurate than Tesseract but will also come with expensive licence fees.

like image 77
Andrew Cash Avatar answered Oct 10 '22 07:10

Andrew Cash