Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improve Tesseract detection quality

I am trying to extract alphanumeric characters (a-z0-9) which do not form sensefull words from an image which is taken with a consumer camera (including mobile phones). The characters have equal size and font type and are not formated. The actual processing is done under Windows.

The following image shows the raw input: Original image

After perspective processing I apply the following with OpenCV:

  • Convert from RGB to gray
  • Apply cv::medianBlur to remove noise
  • Convert the image to binary using adaptive thresholding cv::adaptiveThreshold
  • I know the number of rows and columns of the grid. Thus I simply extract each grid cell using this information.

After all these steps I get images which look similar to these:

enter image description here

enter image description here

enter image description here

Then I run tesseract (latest SVN version with latest training data) on each extracted cell image individually (I tried different -psm and -l values):

tesseract.exe -l eng -psm 11 sample.png outtext

The results produced by tesseract are not very good:

  • Most characters are not recognized.
  • The grid lines are sometimes interpreted as "l" or "i" characters.

I already experimented with morphologic operations (open, close, erode, dilate) and replaced adaptive thresholding with OTSU thresholding (THRESH_OTSU) but the results got worse.

What else could I try to improve the recognition quality? Or is there even a better method to extract the characters besides using tesseract (for instance template matching?)?

Edit (21-12-2014): I tested simple template matching (using normalized cross correlation and LMS but with even worse results). But I have made a huge step forward by extracting each character using findCountours and then running tesseract with only one character and the -psm 10 option which interprets each input image as a single character. Additonaly I remove non-alphanumeric characters in a post processing step. The first results are encouraging with detection rates of 90% and better. The main problem are misdetections of "9" and "g" and "q" characters.

Regards,

like image 715
Hyndrix Avatar asked Dec 21 '14 06:12

Hyndrix


2 Answers

As I say here, you can tell tesseract to pay attention on "almost same" characters. Also, there is some option in tesseract that don't help you in your example. For instance, a "Pocahonta5S" will become, most of the time, a "PocahontaSS" because the number is in a letter word. You can see in this way so.

Concerning pre-processing, you better have to use a sharpen filter. Don't forget that tesseract will always apply an Otsu's filter before reading anything. If you want good result, sharpening + Adaptive Threshold with some other filters are good ideas.

like image 175
Alto Avatar answered Nov 15 '22 09:11

Alto


I recommend to use OpenCV in Combination with tesseract.

The problem in your input images for tesseract are the non-character regions in your image.

An approach myself

To get rid of these I would use the openCV findContour function to receive all contours in your binary image. Afterwards define some criteria to illiminate the non-character regions. For example only take the regions, which are inside the image and doesn't touch the border, or to only take the regions with a specific region-area or a specific ratio of heigth to width. Find some kind of features, that let you distinguish between character an non-character contours. Afterwards eliminate these non-character regions and handle the images forward to tesseract.

Just as idea for general testing this approach:

Eliminate the non-character regions manual (gimp or paint,...) and give the image to tesseract. If the result fits your expactations you can try to eliminate the the non-character regions with proposed method of above.

like image 1
Mr.Sheep Avatar answered Nov 15 '22 09:11

Mr.Sheep