Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python OCR: ignore signatures in documents

I'm trying to do OCR of a scanned document which has handwritten signatures in it. See the image below.

enter image description here

My question is simple, is there a way to still extract the names of the people using OCR while ignoring the signatures? When I run Tesseract OCR it fails to retrieve the names. I tried grayscaling/blurring/thresholding, using the code below, but without luck. Any suggestions?

image = cv2.imread(file_path)
image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
image = cv2.GaussianBlur(image, (5, 5), 0)
image = cv2.threshold(image, 0, 255, cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU)[1]
like image 384
vdvaxel Avatar asked Aug 25 '18 19:08

vdvaxel


3 Answers

You can use scikit-image's Gaussian filter to blur thin lines first (with an appropriate sigma), followed by binarization of image (e.g., with some thresholding function), then by morphological operations (such as remove_small_objects or opening with some appropriate structure), to remove the signatures mostly and then try classification of the digits with sliding window (assuming that one is already trained with some blurred characters as in the test image). The following shows an example.

from skimage.morphology import binary_opening, square
from skimage.filters import threshold_minimum
from skimage.io import imread
from skimage.color import rgb2gray
from skimage.filters import gaussian

im = gaussian(rgb2gray(imread('lettersig.jpg')), sigma=2)
thresh = threshold_minimum(im)
im = im > thresh
im = im.astype(np.bool)
plt.figure(figsize=(20,20))
im1 = binary_opening(im, square(3))
plt.imshow(im1)
plt.axis('off')
plt.show()    

enter image description here

[EDIT]: Use Deep Learning Models

Another option is to pose the problem as an object detection problem where the alphabets are objects. We can use deep learning: CNN/RNN/Fast RNN models (with tensorflow/keras) for object detection or Yolo model (refer to the this article for car detection with yolo model).

like image 159
Sandipan Dey Avatar answered Nov 02 '22 16:11

Sandipan Dey


I suppose the input pictures are grayscale, otherwise maybe the different color of the ink could have a distinctive power.

The problem here is that, your training set - I guess - contains almost only 'normal' letters, without the disturbance of the signature - so naturally the classifier won't work on letters with the ink of signature on them. One way to go could be to extend the training set with letters of this type. Of course it is quite a job to extract and label these letters one-by-one.

You can use real letters with different signatures on them, but it might be also possible to artificially generate similar letters. You just need different letters with different snippets of signatures moved above them. This process might be automated.

like image 1
Lricsi Avatar answered Nov 02 '22 17:11

Lricsi


You may try to preprocess the image with morphologic operations.

You can try opening to remove the thin lines of the signature. The problem is that it may remove the punctuation as well.

image = cv2.imread(file_path)
image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
kernel = cv2.getStructuringElement(cv2.MORPH_CROSS,(5,5))
image = cv2.morphologyEx(image, cv2.MORPH_OPEN, kernel)

You may have to alter the kernel size or shape. Just try different sets.

like image 1
Mailerdaimon Avatar answered Nov 02 '22 17:11

Mailerdaimon