Image to text recognition using Tesseract-OCR is better when Image is preprocessed manually using Gimp than my Python Code

Tags:

I am trying to write code in Python for the manual Image preprocessing and recognition using Tesseract-OCR.

Manual process:
For manually recognizing text for a single Image, I preprocess the Image using Gimp and create a TIF image. Then I feed it to Tesseract-OCR which recognizes it correctly.

To preprocess the image using Gimp I do -

Change mode to RGB / Grayscale
Menu -- Image -- Mode -- RGB
Thresholding
Menu -- Tools -- Color Tools -- Threshold -- Auto
Change mode to Indexed
Menu -- Image -- Mode -- Indexed
Resize / Scale to Width > 300px
Menu -- Image -- Scale image -- Width=300
Save as Tif

Then I feed it tesseract -

$ tesseract captcha.tif output -psm 6

And I get an accurate result all the time.

Python Code:
I have tried to replicate above procedure using OpenCV and Tesseract -

def binarize_image_using_opencv(captcha_path, binary_image_path='input-black-n-white.jpg'):
    im_gray = cv2.imread(captcha_path, cv2.CV_LOAD_IMAGE_GRAYSCALE)
    (thresh, im_bw) = cv2.threshold(im_gray, 128, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)
    # although thresh is used below, gonna pick something suitable
    im_bw = cv2.threshold(im_gray, thresh, 255, cv2.THRESH_BINARY)[1]
    cv2.imwrite(binary_image_path, im_bw)

    return binary_image_path

def preprocess_image_using_opencv(captcha_path):
    bin_image_path = binarize_image_using_opencv(captcha_path)

    im_bin = Image.open(bin_image_path)
    basewidth = 300  # in pixels
    wpercent = (basewidth/float(im_bin.size[0]))
    hsize = int((float(im_bin.size[1])*float(wpercent)))
    big = im_bin.resize((basewidth, hsize), Image.NEAREST)

    # tesseract-ocr only works with TIF so save the bigger image in that format
    tif_file = "input-NEAREST.tif"
    big.save(tif_file)

    return tif_file

def get_captcha_text_from_captcha_image(captcha_path):

    # Preprocess the image befor OCR
    tif_file = preprocess_image_using_opencv(captcha_path)

    #   Perform OCR using tesseract-ocr library
    # OCR : Optical Character Recognition
    image = Image.open(tif_file)
    ocr_text = image_to_string(image, config="-psm 6")
    alphanumeric_text = ''.join(e for e in ocr_text)

    return alphanumeric_text

But I am not getting the same accuracy. What did I miss?

Update 1:

Original Image
Tif Image created using Gimp
Tif Image created by my python code

Update 2:

This code is available at https://github.com/hussaintamboli/python-image-to-text

856

asked Sep 09 '15 07:09

Hussain

1 Answers

If the output is only minimally deviating from your expected output (i.e. extra '," etc. as suggested in your comments) try limiting character recognition to the character set you expect (e.g. alphanumeric).

170

answered Nov 01 '22 18:11

Aurelian

Related questions
                            
                                Shuffling multiple HDF5 datasets in-place
                            
                                Differences between enumerate(fileinput.input(file)) and enumerate(file)
                            
                                Heroku. New Relic Procfile command doesn't work
                            
                                Parse SQL Script to extract table and column names
                            
                                Count occurrences of digit 'x' in range (0,n]
                            
                                Selenium: Run test on my machine remotely?
                            
                                How to install a Python Windows service using cx_Freeze?
                            
                                Filter and Sort on Custom Field in Flask-admin ModelView
                            
                                Set space between boxplots in Python Graphs generated nested box plots with Seaborn?
                            
                                What can I do to speed up Stanford CoreNLP (dcoref/ner)?
                            
                                numpy array from csv file for lasagne
                            
                                Python: How to replace text in pdf
                            
                                How to get PyQt4 working with PyCharm
                            
                                Is there a way to access a function's attributes/parameters within a ContextDecorator?
                            
                                numpy "Mean of empty slice." warning
                            
                                Resampling in Pandas while keeping value associations
                            
                                loop to make every combination of several lists
                            
                                How to split a sorted list into sub lists when two neighboring value difference is larger than a threshold
                            
                                ffmpeg in Python subprocess - Unable to find a suitable output format for 'pipe:'
                            
                                What should a Python project structure look like for Travis CI to find and run tests?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Image to text recognition using Tesseract-OCR is better when Image is preprocessed manually using Gimp than my Python Code

Tags:

python

image

opencv

tesseract

python-tesseract