Python tesseract increase accuracy for OCR

Question

I have quite simple pictures, but tesseract is not succeeding in giving me correct answers.

code:

pytesseract.image_to_string(image, lang='eng')

enter image description here

Example picture gives a result of

SARVN PRIM E N EU ROPTICS
BLU EPRINT

I have also tried to add my own words to dictionary, if it makes something better, but still no.

pytesseract.image_to_string(image, lang='eng', config="--user-words words.txt")

My word list looks like this

SARYN
PRIME
NEUROPTICS
BLUEPRINT

How should I approach the problem, maybe I have to convert the image before predicting? The text color could vary between couple of colors, but background is always black.

Hussam Barouqa · Accepted Answer

Try inverting the image then doing a binarization/thresholding process to get black text on a white background before using trying OCR.

See this post for tips on the binarization of an image in Python.

Of course, the better the quality and the sharper the text in the input image, the better your OCR results will be.

I used an external tool to change it to black on white and got the below image.

Inverted and Binarized

Ahx · Answer

I have a four-step solution

1. Smooth the image
1. Apply simple-threshold
1. Take sentences line-by-line
1. Apply erosion to each individual sentence

	Result
Smoothing
Threshold
Upsample + Erode
Pytesseract	SARYN PRIME NEUVROPTICS BLUEPRINT

Code:

import cv2
import pytesseract

img = cv2.imread('j0nNV.png')
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
blr = cv2.GaussianBlur(gry, (3, 3), 0)
thr = cv2.threshold(blr, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
(h_thr, w_thr) = thr.shape[:2]
s_idx = 0
e_idx = int(h_thr/2)

for _ in range(0, 2):
    crp = thr[s_idx:e_idx, 0:w_thr]
    (h_crp, w_crp) = crp.shape[:2]
    crp = cv2.resize(crp, (w_crp*2, h_crp*2))
    crp = cv2.erode(crp, None, iterations=1)
    s_idx = e_idx
    e_idx = s_idx + int(h_thr/2)
    txt = pytesseract.image_to_string(crp)
    print(txt)
    cv2.imshow("crp", crp)
    cv2.waitKey(0)

Python tesseract increase accuracy for OCR

Tags:

python

machine-learning

ocr

tesseract

python-tesseract

Jaanus

2 Answers

Hussam Barouqa

Ahx

Recent Activity

Donate For Us

Python tesseract increase accuracy for OCR

Tags:

python

machine-learning

ocr

tesseract

python-tesseract

Jaanus

2 Answers

Hussam Barouqa

Ahx

Related questions

Recent Activity

Donate For Us