Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pytesseract foreign language extraction using python

I am using Python 2.7, Pytesseract-0.1.7 and Tesseract-ocr 3.05.01 on a Windows machine.

I tried to extract text for Korean and Russian languages, and I am positive that I extracted.

And now I need to compare with the string and string got extracted from the image.

I can't compare the strings and to get the correct result, it just says not match.

Here is my code :

# -*- coding: utf-8 -*-
from PIL import Image
import pytesseract
import argparse
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--input", required=True, help="path to the image")
args = vars(ap.parse_args())
img = Image.open(args["input"])
img.load()
text = pytesseract.image_to_string(img)
print(text)
text = text.encode('ascii')
print(text)
i = 'Сред. Скорость'
print i
if ( text == i):
    print "Match"
else :
    print "Not Match"

The image used to extract text is attached.

Now I need a way to match it. And also I need to know the string extracted from pytesseract will be in Unicode or what? and if there is way to convert it into Unicode (like we have option in wordpad for converting character into Unicode)

Russian text image

like image 587
Deepan Raj Avatar asked Jun 22 '17 06:06

Deepan Raj


1 Answers

You are using Tesseract with a language other than English, so first of all, make sure, that you have learning dataset for your language installed, as it is shown here (linux instructions only).

Secondly, I strongly suggest you to switch to Python 3 if you are working with non ascii langugages (as I do, as a slovenian). Python 3 works with Unicode out of the box, so it really saves you tons of pain with encoding and decoding strings...

# python3 obligatory !!!    
from PIL import Image
import pytesseract

img = Image.open("T9esw.png")
img.load()
text = pytesseract.image_to_string(img, lang="rus")  #Specify language to look after!
print(text)
i = 'Сред. Скорость'
print(i)
if (text == i):
    print("Match")
else :
    print("Not Match")

Which outputs:

Фред скорасть
Сред. Скорость
Not Match

This means the words didn't quite match, but still, considering the minimal coding effort and awful quality of input image, it think that the performance is quite amazing. Anyways, the example shows that encoding and decoding should no longer be a problem.

like image 162
Marjan Moderc Avatar answered Oct 10 '22 02:10

Marjan Moderc