Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Empty string with Tesseract

I'm trying to read different cropped images from a big file and I manage to read most of them but there are some of them which return an empty string when I try to read them with tesseract.

String to read with tesseract

The code is just this line:

pytesseract.image_to_string(cv2.imread("img.png"), lang="eng")

Is there anything I can try to be able to read these kind of images?

Thanks in advance

Edit: enter image description here

like image 689
Alberto Carmona Avatar asked Dec 15 '18 20:12

Alberto Carmona


People also ask

What is the difference between Pytesseract and Tesseract?

Tesserocr is a Python wrapper around the Tesseract C++ API. Whereas Pytesseract is a wrapper for the tesseract-ocr CLI. Therefore with Tesserocr you can load the model at the beginning or your program, and run the model separately (for example in loops to process videos).

How do you use Pytesseract in Jupyter notebook?

Create a Python script (a . py-file), or start up a Jupyter notebook. At the top of the file, import pytesseract , then point pytesseract at the tesseract installation you discovered in the previous step. Note the r' ' at the start of the string that defines the file location.


1 Answers

Thresholding the image before passing it to pytesseract increases the accuracy.

import cv2
import numpy as np

# Grayscale image
img = Image.open('num.png').convert('L')
ret,img = cv2.threshold(np.array(img), 125, 255, cv2.THRESH_BINARY)

# Older versions of pytesseract need a pillow image
# Convert back if needed
img = Image.fromarray(img.astype(np.uint8))

print(pytesseract.image_to_string(img))

This printed out

5.78 / C02

Edit: Doing just thresholding on the second image returns 11.1. Another step that can help is to set the page segmentation mode to "Treat the image as a single text line." with the config --psm 7. Doing this on the second image returns 11.1 "202 ', with the quotation marks coming from the partial text at the top. To ignore those, you can also set what characters to search for with a whitelist by the config -c tessedit_char_whitelist=0123456789.%. Everything together:

pytesseract.image_to_string(img, config='--psm 7 -c tessedit_char_whitelist=0123456789.%')

This returns 11.1 202. Clearly pytesseract is having a hard time with that percent symbol, which I'm not sure how to improve on that with image processing or config changes.

like image 173
A Kruger Avatar answered Sep 28 '22 10:09

A Kruger