I've got a bunch of dates i'm trying to OCR using tesseract. However, a lot of the digits in the dates merge with the lines in the date boxes as so:
Also, here's a good image that i can tesseract well with:
And here's my code:
import os
import cv2
from matplotlib import pyplot as plt
import subprocess
import numpy as np
from PIL import Image
def show(img):
plt.figure(figsize=(20,20))
plt.imshow(img,cmap='gray')
plt.show()
def sort_contours(cnts, method="left-to-right"):
# initialize the reverse flag and sort index
reverse = False
i = 0
# handle if we need to sort in reverse
if method == "right-to-left" or method == "bottom-to-top":
reverse = True
# handle if we are sorting against the y-coordinate rather than
# the x-coordinate of the bounding box
if method == "top-to-bottom" or method == "bottom-to-top":
i = 1
# construct the list of bounding boxes and sort them from top to
# bottom
boundingBoxes = [cv2.boundingRect(c) for c in cnts]
cnts, boundingBoxes = zip(*sorted(zip(cnts, boundingBoxes),
key=lambda b:b[1][i], reverse=reverse))
# return the list of sorted contours and bounding boxes
return cnts, boundingBoxes
def tesseract_it(contours,main_img, label,delete_last_contour=False):
min_limit, max_limit = (1300,1700)
idx =0
roi_list = []
slist= set()
for cnt in contours:
idx += 1
x,y,w,h = cv2.boundingRect(cnt)
if label=='boxes':
roi=main_img[y+2:y+h-2,x+2:x+w-2]
else:
roi=main_img[y:y+h,x:x+w]
if w*h > min_limit and w*h < max_limit and w>10 and w< 50 and h>10 and h<50:
if (x,y,w,h) not in slist: # Stops from identifying repeted contours
roi = cv2.resize(roi,dsize=(45,45),fx=0 ,fy=0, interpolation = cv2.INTER_AREA)
roi_list.append(roi)
slist.add((x,y,w,h))
if not delete_last_contour:
vis = np.concatenate((roi_list),1)
else:
roi_list.pop(-1)
vis = np.concatenate((roi_list),1)
show(vis)
# Tesseract the final image here
# ...
image = 'bad_digit/1.jpg'
# image = 'bad_digit/good.jpg'
specimen_orig = cv2.imread(image,0)
specimen = cv2.fastNlMeansDenoising(specimen_orig)
# show(specimen)
kernel = np.ones((3,3), np.uint8)
# Now we erode
specimen = cv2.erode(specimen, kernel, iterations = 1)
# show(specimen)
_, specimen = cv2.threshold(specimen, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
# show(specimen)
specimen_canny = cv2.Canny(specimen, 0, 0)
# show(specimen_canny)
specimen_blank_image = np.zeros((specimen.shape[0], specimen.shape[1], 3))
_,specimen_contours, retr = cv2.findContours(specimen_canny.copy(), cv2.RETR_LIST, cv2.CHAIN_APPROX_NONE )
# print(len(specimen_contours))
cv2.drawContours(specimen_blank_image, specimen_contours, -1, 100, 2)
# show(specimen_blank_image)
specimen_blank_image = np.zeros((specimen.shape[0], specimen.shape[1], 3))
specimen_sorted_contours, specimen_bounding_box = sort_contours(specimen_contours)
output_string = tesseract_it(specimen_sorted_contours,specimen_orig,label='boxes',)
# return output_string
The output from the good image attached is so:
Tesseracting this image does give me accurate results.
However, for the ones where the lines are merging into the digits, my output looks like this:
These do not work well with Tesseract at all. I was wondering if there was a way to remove the lines and keep only the digits.
I have tried the following as well: https://docs.opencv.org/3.2.0/d1/dee/tutorial_moprh_lines_detection.html
Which doesn't really seem to do great on the images i've attached.
I've also tried to use imagemagick:
convert original.jpg \
\( -clone 0 -threshold 50% -negate -statistic median 200x1 \) \
-compose lighten -composite \
\( -clone 0 -threshold 50% -negate -statistic median 1x200 \) \
-composite output.jpg
Its results are fair, but the line removed somewhat cuts through the digits as following:
Is there a better way i can approach this problem? My final goal is to tesseract the digits, so the final image does need to be quite clear.
Here is some code that seems to work quite well. There are two phases:
Here is one image's result after 1st phase:
And here are all results after 2nd phase:
As you see it's not perfect, 8 can be seen as B (well, even a human like me sees it as a B... but it can be easily fixed if you have only numbers in your world). There is also like a ":" character (a legacy from a vertical line that has been removed) that I can't get rid of either w/o tweaking the code too much...
The C# code:
static void Unbox(string inputFilePath, string outputFilePath)
{
using (var orig = new Mat(inputFilePath))
{
using (var gray = orig.CvtColor(ColorConversionCodes.BGR2GRAY))
{
using (var dst = orig.EmptyClone())
{
// this is what I call the "horizontal shake" pass.
// note I use the Rect shape here, this is important
using (var dilate = Cv2.GetStructuringElement(MorphShapes.Rect, new Size(4, 1)))
{
Cv2.Dilate(gray, dst, dilate);
}
// erode just a bit to get back some numbers to life
using (var erode = Cv2.GetStructuringElement(MorphShapes.Rect, new Size(2, 1)))
{
Cv2.Erode(dst, dst, erode);
}
// at this point, good OCR will see most numbers
// but we want to remove surrounding artifacts
// find countours
using (var canny = dst.Canny(0, 400))
{
var contours = canny.FindContoursAsArray(RetrievalModes.List, ContourApproximationModes.ApproxSimple);
// compute a bounding rect for all numbers w/o boxes and artifacts
// this is the tricky part where we try to discard what's not related exclusively to numbers
var boundingRect = Rect.Empty;
foreach (var contour in contours)
{
// discard some small and broken polygons
var polygon = Cv2.ApproxPolyDP(contour, 4, true);
if (polygon.Length < 3)
continue;
// we want only numbers, and boxes are approx 40px wide,
// so let's discard box-related polygons, if any
// and some other artifacts that passed previous checks
// this quite depends on some context knowledge...
var rect = Cv2.BoundingRect(polygon);
if (rect.Width > 40 || rect.Height < 15)
continue;
boundingRect = boundingRect.X == 0 ? rect : boundingRect.Union(rect);
}
using (var final = dst.Clone(boundingRect))
{
final.SaveImage(outputFilePath);
}
}
}
}
}
}
Just a suggestion, I never tried that.
Instead of trying to remove the bars, keep them and train on all possible bar positions. Trim the bars to the character limits for proper alignment.
Train these as 02032018022018
. I guess it is better to simulate the bars on clean characters.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With