How can I train my Python based OCR with Tesseract to train with different National Identity Cards?

Tags:

I am working with python to make an OCR system that reads from the ID Cards and give the exact results from the image but it is not giving me the righteous answers as there are so many wrong characters that the tesseract reads. How can I train tesseract in a way that it reads the ID card perfectly and gives us the right and exact details, furthermore how can I get myself to the .tiff file and to make tesseract work for my project.

809

asked Dec 13 '18 12:12

M A K

1 Answers

Steps to improve Pytesseract recognition:

Clean your image arrays so there is only text(font generated, not handwritten). The edges of letters should be without distortion. Apply threshold (try different values). Also apply some smoothing filters. I also recommend to use Morpholofical opening/closing - but thats only a bonus. This is exaggerated example of what should enter pytesseract recognition in form of array: https://i.ytimg.com/vi/1ns8tGgdpLY/maxresdefault.jpg
Resize the image with text you want to recognize to higher resolution
Pytesseract should generally recognize letters of any kind, but by installing font in which the text is written, you are superbly increasing accuracy.

How to install new fonts into pytesseract:

Get your desired font in TIFF format
Upload it to http://trainyourtesseract.com/ and receive trained data into your email (EDIT: This site doesnt exist anymore. At this moment you have to find alternative or train font yourself)
add the trained data file (*.traineddata) to this folder C:\Program Files (x86)\Tesseract-OCR\tessdata
add this string command to pytesseract reconition function:

lets say you have 2 trained fonts: font1.traineddata and font2.traineddata
To use both, use this command

txt = pytesseract.image_to_string(img, lang='font1+font2')

Here is a code to test your recognition on web images:

import cv2
import pytesseract
import cv2
import numpy as np
import urllib
import requests
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
TESSDATA_PREFIX = 'C:/Program Files (x86)/Tesseract-OCR'
from PIL import Image

def url_to_image(url):
    resp = urllib.request.urlopen(url)
    image = np.asarray(bytearray(resp.read()), dtype="uint8")
    image = cv2.imdecode(image, cv2.IMREAD_COLOR)
    return image

url='http://jeroen.github.io/images/testocr.png'


img = url_to_image(url)


#img = cv2.GaussianBlur(img,(5,5),0)
img = cv2.medianBlur(img,5) 
retval, img = cv2.threshold(img,150,255, cv2.THRESH_BINARY)
txt = pytesseract.image_to_string(img, lang='eng')
print('recognition:', txt)
>>> txt
'This ts a lot of 12 point text to test the\nocr code and see if it works on all types\nof file format\n\nThe quick brown dog jumped over the\nlazy fox The quick brown dog jumped\nover the lazy fox The quick brown dog\njumped over the lazy fox The quick\nbrown dog jumped over the lazy fox'

answered Sep 26 '22 18:09

Martin

Related questions
                            
                                List comparison of element
                            
                                Something wrong with how I'm bundling rasterio into an executable
                            
                                Python: How to script virtual environment building and activation?
                            
                                SQLAlchemy polymorphic on multiple identities for a class
                            
                                Taking in multiple inputs for a fixed time [duplicate]
                            
                                Pandas groupby city and month and fill in missing months
                            
                                Pycharm TabError: inconsistent use of tabs and spaces in indentation
                            
                                Trying to find sums of unique values within a nested dictionary. (See example!)
                            
                                How to do a "element by element in-place inverse" with pytorch?
                            
                                Negative Bounds for Slice Operator [duplicate]
                            
                                convert csv to json (nested objects)
                            
                                Improved efficiency versus iterating over two large Pandas Dataframes
                            
                                How to hide and show canvas items on tkinter?
                            
                                When to use __iter__() vs iter()?
                            
                                mypy error, overload with Union/Optional, "Overloaded function signatures 1 and 2 overlap with incompatible return types"
                            
                                Is the Python's grammar LL(1)?
                            
                                Nginx with gunicorn with double authorization
                            
                                Comparing list of Counters in Python
                            
                                Why does networkx redraw my graph different each run?
                            
                                Python Merge Two Numpy Arrays Based on Condition

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I train my Python based OCR with Tesseract to train with different National Identity Cards?

Tags:

python

tesseract

M A K

People also ask

1 Answers

Martin

Recent Activity

Donate For Us