I'm trying to extract information from a form (scanned images of a form) and place that information into a table. I have used pytesseract to OCR the image with good success, but the problem with the output is the fact that Tesseract attempts to extract text line by line.
My scanned form looks like this:
Each window of the form (A, B, C) should be a different row in a table. I'm trying to use Open Computer Vision (in python) to identify the individual windows to 1) identify individual units of data (the A, B, C), 2) crop each individual window, and 3) Use Tesseract to OCR the image of the individual window to put the information where it needs to go in a SQL table.
My question: How can I identify the boundaries of each individual table entry window, and crop the image to only the extent of that boundary (to then apply OCR)? Also, is it possible to use corner detection to identify the individual units of data?
I am primarily using python with OpenCV, and am familiar enough with the documentation to apply a C#/++ OpenCV solution to a python script, so I would appreciate any information/alternative solutions you can provide.
It's possible to separate them section wise using contours and simple contour properties alone.
Note : These procedures will only work properly for this particular form. It's not a universal solution for all kinds of irregular forms. However you can implement or tweak certain methods in order to make this work for your form
First read the image
image=cv2.imread("TDtma.png")
Convert it to grayscale
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
Use Canny Edge filter to get the edges - the values 600,1000 were chosen by random experimenting. I chose this value as it removes the background artifact properly. You may need to change and choose the right values for this depending on the images that you are going to input.
edges = cv2.Canny(gray,600,1000)
Use blur filter to remove minor artifacts that would be present in real-world image (such as handwriting etc)
edges = cv2.GaussianBlur(edges,(5,5),0) # To remove small artifacting if any
Next we find the external contours because the 3 rectangles (sections) are visibly separated and all we need to do is just find all the external contours. Do note that this code may be different for OpenCV 2.4.x.
(_,contours,_) = cv2.findContours(edges, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
For some reason contours are detected from bottom to top. So we have a character C that is decremented to A just for labeling our regions of interest.
FormPart = ord('C')
Loop through each contour, and we crop the region of interest.
We check whether each contour has the right aspect ratio and area, again these values (aspect ratio :2, area :1000) are obtained through experimentation, and may need to be changed depending upon real life input images. Ideally in our case a rectangle should have aspect ratio >2 (one side of the rectangle is always bigger than the other, the rectangles in this image have ratio >2). We check if area is >1000 so as to avoid any kind of contours that were detected due to small artifacts. Again these values may need to be changed accordingly so as to work with real world images properly.
This given image will be processed properly even without checking contour area and aspect ratio, but there may be issues with real-world images due to small blobs, so in order to avoid them, the area/aspect ratio check is being done.
for contour in contours:
x,y,w,h = cv2.boundingRect(contour)
aspect_ratio = w / float(h)
area = cv2.contourArea(contour)
if aspect_ratio<2 or area >1000: # Just to check whether we have the right contour, if not we go to the next contour
continue
crop_img = image[y:y+h,x:x+w] #This is our region of interest
cv2.imshow("Split Section "+chr(FormPart), crop_img)
cv2.waitKey(0)
FormPart=FormPart-1
if chr(FormPart) < ord('A'): # If there are more than 3 sections
break
Finally we have a full program here that you can copy and paste and run on your machine. Make sure you have Python >2.7.x and OpenCV 3. Some lines may need to be changed so as to work with OpenCV 2.4 Also make sure the image is named "TDtma.png" and is in the same directory as the Python program
import cv2
image=cv2.imread("TDtma.png")
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
edges = cv2.Canny(gray,600,1000) # To remove the irrelevant edges and show the relevant ones
cv2.imshow("Canny edge detection", edges)
cv2.waitKey(0)
edges = cv2.GaussianBlur(edges,(5,5),0) # To remove small artifacting if any
(_,contours,_) = cv2.findContours(edges, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE) # Detecting external contours
#If you are on opencv 2.4x use this
#(contours,_)= cv2.findContours(edgescopy, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
FormPart = ord('C')# Contour goes from bottom to top in this example
for contour in contours:
x,y,w,h = cv2.boundingRect(contour)
aspect_ratio = w / float(h)
area = cv2.contourArea(contour)
if aspect_ratio<2 or area <1000: #Go to next contour if this contour doesnt meet our specifications
continue
crop_img = image[y:y+h,x:x+w] #This is our region of interest
cv2.imshow("Split Section "+chr(FormPart), crop_img)
cv2.waitKey(0)
FormPart=FormPart-1
if chr(FormPart) < ord('A'): # If there are more than 3 sections
break
And finally you should have something like this
It's possible to separate these individual data cells in textfields as well. It's a bit complicated though and may possibly not work right with real world images. If you want I can try. Leave a comment if you need it.
Hope I was able to help
In this case, what you should do is take a look at OpenCV findContours. Make sure to use the RETR_TREE
retrieval method to obtain a hierarchy of contours.
Your windows should be the highest level contours in your image. See my answer here to get an idea of how to navigate the hierarchy returned by OpenCV.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With