What causes pytesseract to read either the top or bottom text-line of a dual-line image depending on whether opencv or pillow is used?

Question

EDIT: I forgot to process the image which solves the reading issue, thanks to Nathancy. Still wondering what makes Tesseract read only the top OR the bottom line of an unprocessed image (same image, two different outcomes)

Orignal:
I have an image that contains two lines of text: random test image for pytesseract

When I open the image within python (IDLE Python 3.6) with PIL Image and use pytesseract to extract a string, it only extracts the last/bottom line correctly. The upper line of text is scrambled garbage.(see code section below)
However, when I use opencv to open the image and use pytesseract to extract a string, it only extracts the top/upper line correctly whilst making a mess of the second/bottom line of text.(see also code section below)

Here is the code:

>>> from PIL import Image, ImageFilter
>>> import pytesseract
>>> pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR	esseract.exe"
>>> import cv2

>>> img = Image.open(r"C:\Users\user\MyImage.png")
>>> img2 = cv2.imread(r"C:\Users\user\MyImage.png", cv2.IMREAD_COLOR)


>>> print(pytesseract.image_to_string(img2))
Pet Sock has 448/600 HP left
A ae eee PER eats ae

>>> print(pytesseract.image_to_string(img))
Le TL
JHE has 329/350 HP left.

When I use pytesseract.image_to_boxes on both img and img2 it will show the same bounding box for certain locations with a different letter (only showing 2 extracted lines which contain an identical box)

>>> print(pytesseract.image_to_boxes(img2))
A 4 6 10 16 0

>>> print(pytesseract.image_to_boxes(img))
J 4 6 10 16 0

When I use the pytesseract.image_to_data on both img and img2 it shows very high (95+) confidence on the line it reads correctly and very low (30-) on the garbled line.
Excel table output of image_to_data
edit: excel tables are img2 and img accordingly

I fiddled around with the psm config values (I have tried them all) and except for creating more garbage on settings: 5, 7, 8, 9, 10, 13; and some giving an error: 0, 2; it gave no different results than the default (which is 3 I believe)

I must be making some rookie mistake but I can't get my head around why this is happening. If anyone can shine a light in the right direction it would be awesome.

The image was just a fitting, but random, image for an OCR test that I had laying around. No further intentions than experimenting with pytesseract.

nathancy · Accepted Answer

Whenever performing OCR with Pytesseract, it is important to preprocess the image so that the text is in black with the background in white. We can do this with simple thresholding

enter image description here

Output from Pytesseract

Pet Sock has 448/600 HP left
JHE has 329/359 HP left.

Code

import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR	esseract.exe"

image = cv2.imread('1.png')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

data = pytesseract.image_to_string(thresh, lang='eng',config='--psm 6')
print(data)

cv2.imshow('thresh', thresh)
cv2.waitKey()

What causes pytesseract to read either the top or bottom text-line of a dual-line image depending on whether opencv or pillow is used?

Tags:

python

opencv

python-imaging-library

ocr

python-tesseract

non-english-programmer

1 Answers

nathancy

Recent Activity

Donate For Us

What causes pytesseract to read either the top or bottom text-line of a dual-line image depending on whether opencv or pillow is used?

Tags:

python

opencv

python-imaging-library

ocr

python-tesseract

non-english-programmer

1 Answers

nathancy

Related questions

Recent Activity

Donate For Us