Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Android: How to improve the numbers within the image retrieved by tesseract ocr?

I made a simple app that reads images and retrieves the number image as text with android. But the problem is that the accuracy is only about 60% and some unwanted noise also shows as well. I do perceive that the accuracy cannot be good as 100%,however, I believe that there must be a way to improve it. But, since I'm an amateur, I find it difficult. I've searched around google but was unable to gain a solid information.

I want to read the numbers 596 , 00 , and 012345 from a oriental lucky tickets like the image below.

enter image description here

like image 378
Jennifer Avatar asked Jan 28 '15 06:01

Jennifer


1 Answers

Tesseract-ocr works best on images of characters which meet the following criteria:

  • The input image should have atleast 300 dpi

  • The input image should be black and white

  • There should be minimal noise in the input image (i.e. the text should be clearly distinguishable from the background)

  • Text lines should be straight

  • The image should be centered around the text to be detected

(See the tesseract-ocr wiki for further details)

For a given input image, tesseract will try to pre-process and clean the image to meet these criteria, but to maximise your detection accuracy, it is best to do the pre-processing yourself.

Based on the input image you provided, the main problem is that there is too much background noise. To remove the background noise from the text in the image, I have found that applying the Stroke Width Transform (SWT) algorithm with a threshold value to remove noise gives promising results. A fast implementation of SWT with many configurable parameters is provided in the libCCV library. How well it cleans the image depends on a number of factors including image size, uniformity of stroke width and other input parameters to the algorithm. A list of the configurable parameters is provided here.

You then pass the output of SWT to tesseract to obtain the text values of characters in the image.

If the image passed to tesseract still contains some noise, it may return some false detections such as punctuation characters. Given that the image you are processing is likely to only contain letters and numbers a-z A-Z 0-9, you can simply apply a regex to the output to remove any final false detections.

like image 52
sparkplug Avatar answered Oct 13 '22 21:10

sparkplug