Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Image to text recognition using Tesseract-OCR is better when Image is preprocessed manually using Gimp than my Python Code

I am trying to write code in Python for the manual Image preprocessing and recognition using Tesseract-OCR.

Manual process:
For manually recognizing text for a single Image, I preprocess the Image using Gimp and create a TIF image. Then I feed it to Tesseract-OCR which recognizes it correctly.

To preprocess the image using Gimp I do -

  1. Change mode to RGB / Grayscale
    Menu -- Image -- Mode -- RGB
  2. Thresholding
    Menu -- Tools -- Color Tools -- Threshold -- Auto
  3. Change mode to Indexed
    Menu -- Image -- Mode -- Indexed
  4. Resize / Scale to Width > 300px
    Menu -- Image -- Scale image -- Width=300
  5. Save as Tif

Then I feed it tesseract -

$ tesseract captcha.tif output -psm 6

And I get an accurate result all the time.

Python Code:
I have tried to replicate above procedure using OpenCV and Tesseract -

def binarize_image_using_opencv(captcha_path, binary_image_path='input-black-n-white.jpg'):
    im_gray = cv2.imread(captcha_path, cv2.CV_LOAD_IMAGE_GRAYSCALE)
    (thresh, im_bw) = cv2.threshold(im_gray, 128, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)
    # although thresh is used below, gonna pick something suitable
    im_bw = cv2.threshold(im_gray, thresh, 255, cv2.THRESH_BINARY)[1]
    cv2.imwrite(binary_image_path, im_bw)

    return binary_image_path

def preprocess_image_using_opencv(captcha_path):
    bin_image_path = binarize_image_using_opencv(captcha_path)

    im_bin = Image.open(bin_image_path)
    basewidth = 300  # in pixels
    wpercent = (basewidth/float(im_bin.size[0]))
    hsize = int((float(im_bin.size[1])*float(wpercent)))
    big = im_bin.resize((basewidth, hsize), Image.NEAREST)

    # tesseract-ocr only works with TIF so save the bigger image in that format
    tif_file = "input-NEAREST.tif"
    big.save(tif_file)

    return tif_file

def get_captcha_text_from_captcha_image(captcha_path):

    # Preprocess the image befor OCR
    tif_file = preprocess_image_using_opencv(captcha_path)

    #   Perform OCR using tesseract-ocr library
    # OCR : Optical Character Recognition
    image = Image.open(tif_file)
    ocr_text = image_to_string(image, config="-psm 6")
    alphanumeric_text = ''.join(e for e in ocr_text)

    return alphanumeric_text    

But I am not getting the same accuracy. What did I miss?

Update 1:

  1. Original Image
    enter image description here
  2. Tif Image created using Gimp
    enter image description here
  3. Tif Image created by my python code
    enter image description here

Update 2:

This code is available at https://github.com/hussaintamboli/python-image-to-text

like image 856
Hussain Avatar asked Sep 09 '15 07:09

Hussain


People also ask

Is Tesseract good for OCR?

While Tesseract is known as one of the most accurate free OCR engines available today, it has numerous limitations that dramatically affect its performance; its ability to correctly recognize characters in a scan or image.

What is better than Tesseract OCR?

Google Cloud Vision API Just like ABBBY FineReader, it is also a paid service (pricing). Google Vision API does well on the scanned email and recognizes the text in the smartphone-captured document similarly well as ABBYY. However, it is much better than Tesseract or ABBYY in recognizing handwriting.


1 Answers

If the output is only minimally deviating from your expected output (i.e. extra '," etc. as suggested in your comments) try limiting character recognition to the character set you expect (e.g. alphanumeric).

like image 170
Aurelian Avatar answered Nov 01 '22 18:11

Aurelian