Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does pytesseract fail to recognise digits from image with darker background?

I've this python code which I use to convert a text written in a picture to a string, it does work for certain images which have large characters, but not for the one I'm trying right now which contains only digits.

This is the picture:

Digits

This is my code:

import pytesseract
from PIL import Image

img = Image.open('img.png')
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
result = pytesseract.image_to_string(img)
print (result)

Why is it failing at recognising this specific image and how can I solve this problem?

like image 996
alioua walid Avatar asked May 05 '19 17:05

alioua walid


People also ask

How can Pytesseract improve accuracy?

We've had great success improving Tesseract's accuracy by using a diverse set of image (pre)processing commands before running the engine. Also, by dropping out all the non-text elements like lines, images, stamps, etc. you'll have a much better result.

How do I use tesseract to read text from an image?

Create a Python tesseract script Create a project folder and add a new main.py file inside that folder. Once the application gives access to PDF files, its content will be extracted in the form of images. These images will then be processed to extract the text.

How does Pytesseract OCR work?

Tesseract tests the text lines to determine whether they are fixed pitch. Where it finds fixed pitch text, Tesseract chops the words into characters using the pitch, and disables the chopper and associator on these words for the word recognition step.

What is the difference between tesseract and Pytesseract?

Both are OCR wrappers for Python; however, pytesseract is based on Googles OCR API and tesseract isn't. I would suggest using pytesseract based on the fact that it will be maintained better, but with that being said, try them both out and use whichever works better for you.

How to recognize only numbers from image in Python with TesseracT?

So how to recognize only numbers from an image in Python with Tesseract? The first simple solution is to upgrade Tesseract to version > 4.1 because the missing function has been added again in version 4.1 (see this comment ).

How to fix tesseract not working in Python?

Python pytesseract library will call tesseract.exe to extract text from an image, if it can not find this .exe file, pytesseract.pytesseract.TesseractNotFoundError will be reported. How to fix this error? To fix this error, you should install Tesseract OCR and set it into you system environment, then reboot your computer.

How to remove all non-numeric characters from a tesseract image?

Just use the Tesseract image_to_string (...) function to recognize all characters and put the result string into a Python function that removes every non-numeric char. The whole python code that outputs only the number in image.tif looks like this:

What is pytesserocr and how is it used?

Googles Tesseract (originally from HP) is one of the most popular, free Optical Character Recognition (OCR) software out there. It can be used with several programming languages because many wrappers exist for this project. PyTesserocr is an example of a Python wrapper for the tesseract-ocr API.


1 Answers

I have two suggestions.

First, and this is by far the most important, in OCR preprocessing images is key to obtaining good results. In your case I suggest binarization. Your images look extremely good so you shouldn't have any problem but if you do, then maybe you should try to binarize your images:

import cv2
from PIL import Image

img = cv2.imread('gradient.png')
# If your image is not already grayscale :
# img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
threshold = 180 # to be determined
_, img_binarized = cv2.threshold(img, threshold, 255, cv2.THRESH_BINARY)
pil_img = Image.fromarray(img_binarized)

And then try the ocr again with the binarized image.

Check if your image is in grayscale and uncomment if needed.

This is simple thresholding. Adaptive thresholding also exists but it is noisy and does not bring anything in your case.

Binarized images will be much easier for Tesseract to handle. This is already done internally (https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality) but sometimes things can be messed up and very often it's useful to do your own preprocessing.

You can check if the threshold value is right by looking at the images :

import matplotlib.pyplot as plt
plt.imshow(img, cmap='gray')
plt.imshow(img_binarized, cmap='gray')

Second, if what I said above still doesn't work, I know this doesn't answer "why doesn't pytesseract work here" but I suggest you try out tesserocr. It is a maintained python wrapper for Tesseract.

You could try:

import tesserocr
text_from_ocr = tesserocr.image_to_text(pil_img)

Here is the doc for tesserocr from pypi : https://pypi.org/project/tesserocr/

And for opencv : https://pypi.org/project/opencv-python/

As a side-note, black and white is treated symetrically in Tesseract so having white digits on a black background is not a problem.

like image 97
Ashargin Avatar answered Sep 21 '22 08:09

Ashargin