Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove background noise from image to make text more clear for OCR

Tags:

java

c++

opencv

ocr

I've written an application that segments an image based on the text regions within it, and extracts those regions as I see fit. What I'm attempting to do is clean the image so OCR (Tesseract) gives an accurate result. I have the following image as an example:

enter image description here

Running this through tesseract gives a widely inaccurate result. However cleaning up the image (using photoshop) to get the image as follows:

enter image description here

Gives exactly the result I would expect. The first image is already being run through the following method to clean it to that point:

 public Mat cleanImage (Mat srcImage) {     Core.normalize(srcImage, srcImage, 0, 255, Core.NORM_MINMAX);     Imgproc.threshold(srcImage, srcImage, 0, 255, Imgproc.THRESH_OTSU);     Imgproc.erode(srcImage, srcImage, new Mat());     Imgproc.dilate(srcImage, srcImage, new Mat(), new Point(0, 0), 9);     return srcImage; } 

What more can I do to clean the first image so it resembles the second image?

Edit: This is the original image before it's run through the cleanImage function.

enter image description here

like image 399
Zy0n Avatar asked Nov 23 '15 21:11

Zy0n


People also ask

What is OCR noise?

Noise removal is one of the steps in pre- processing. Among other things, noise reduces the accuracy of subsequent tasks of OCR (Optical character Recognition) systems. It can appear in the foreground or background of an image and can be generated before or after scanning.

How do I remove noise from an image in Python?

After greying the image try applying equalize histogram to the image, this allows the area's in the image with lower contrast to gain a higher contrast. Then blur the image to reduce the noise in the background.

How to remove text from image without removing background?

That’s why we enumerated the tools needed to help you how to remove text from image without removing background in the easiest way possible. One of the well-known and most used online photo editors that you can use is PicWish Photo Retouch.

How do I Remove Background noise from my audio?

Here's how to use our free audio editing software to remove the background noise from your audio. Upload your audio file, or start recording your audio directly on Podcastle. Cut the unnecessary parts of your audio, and make any other edits you want to make. Right-click on your audio track and click Add Magic Dust.

How do I use the background eraser?

Using the background eraser, creators can remove any unwanted parts of an image automatically or manually. Using your computer, phone, or tablet, click Upload to find an image you have saved already, or paste a link to any image location on the web like Instagram, Google Drive, or Google Images.

How to remove background noise when recording a podcast?

Upload your MP3 file to Podcastle. Right-click on the audio track and choose Add Magic Dust. We'll remove all the background noise in no time. How to record a professional-quality podcast from your laptop?


Video Answer


1 Answers

My answer is based on following assumptions. It's possible that none of them holds in your case.

  • It's possible for you to impose a threshold for bounding box heights in the segmented region. Then you should be able to filter out other components.
  • You know the average stroke widths of the digits. Use this information to minimize the chance that the digits are connected to other regions. You can use distance transform and morphological operations for this.

This is my procedure for extracting the digits:

  • Apply Otsu threshold to the image otsu
  • Take the distance transform dist
  • Threshold the distance transformed image using the stroke-width ( = 8) constraint sw2

  • Apply morphological operation to disconnect ws2op

  • Filter bounding box heights and make a guess where the digits are

stroke-width = 8 bb stroke-width = 10 bb2

EDIT

  • Prepare a mask using the convexhull of the found digit contours mask

  • Copy digits region to a clean image using the mask

stroke-width = 8 cl1

stroke-width = 10 cl2

My Tesseract knowledge is a bit rusty. As I remember you can get a confidence level for the characters. You may be able to filter out noise using this information if you still happen to detect noisy regions as character bounding boxes.

C++ Code

Mat im = imread("aRh8C.png", 0); // apply Otsu threshold Mat bw; threshold(im, bw, 0, 255, CV_THRESH_BINARY_INV | CV_THRESH_OTSU); // take the distance transform Mat dist; distanceTransform(bw, dist, CV_DIST_L2, CV_DIST_MASK_PRECISE); Mat dibw; // threshold the distance transformed image double SWTHRESH = 8;    // stroke width threshold threshold(dist, dibw, SWTHRESH/2, 255, CV_THRESH_BINARY); Mat kernel = getStructuringElement(MORPH_RECT, Size(3, 3)); // perform opening, in case digits are still connected Mat morph; morphologyEx(dibw, morph, CV_MOP_OPEN, kernel); dibw.convertTo(dibw, CV_8U); // find contours and filter Mat cont; morph.convertTo(cont, CV_8U);  Mat binary; cvtColor(dibw, binary, CV_GRAY2BGR);  const double HTHRESH = im.rows * .5;    // height threshold vector<vector<Point>> contours; vector<Vec4i> hierarchy; vector<Point> digits; // points corresponding to digit contours  findContours(cont, contours, hierarchy, CV_RETR_CCOMP, CV_CHAIN_APPROX_SIMPLE, Point(0, 0)); for(int idx = 0; idx >= 0; idx = hierarchy[idx][0]) {     Rect rect = boundingRect(contours[idx]);     if (rect.height > HTHRESH)     {         // append the points of this contour to digit points         digits.insert(digits.end(), contours[idx].begin(), contours[idx].end());          rectangle(binary,              Point(rect.x, rect.y), Point(rect.x + rect.width - 1, rect.y + rect.height - 1),             Scalar(0, 0, 255), 1);     } }  // take the convexhull of the digit contours vector<Point> digitsHull; convexHull(digits, digitsHull); // prepare a mask vector<vector<Point>> digitsRegion; digitsRegion.push_back(digitsHull); Mat digitsMask = Mat::zeros(im.rows, im.cols, CV_8U); drawContours(digitsMask, digitsRegion, 0, Scalar(255, 255, 255), -1); // expand the mask to include any information we lost in earlier morphological opening morphologyEx(digitsMask, digitsMask, CV_MOP_DILATE, kernel); // copy the region to get a cleaned image Mat cleaned = Mat::zeros(im.rows, im.cols, CV_8U); dibw.copyTo(cleaned, digitsMask); 

EDIT

Java Code

Mat im = Highgui.imread("aRh8C.png", 0); // apply Otsu threshold Mat bw = new Mat(im.size(), CvType.CV_8U); Imgproc.threshold(im, bw, 0, 255, Imgproc.THRESH_BINARY_INV | Imgproc.THRESH_OTSU); // take the distance transform Mat dist = new Mat(im.size(), CvType.CV_32F); Imgproc.distanceTransform(bw, dist, Imgproc.CV_DIST_L2, Imgproc.CV_DIST_MASK_PRECISE); // threshold the distance transform Mat dibw32f = new Mat(im.size(), CvType.CV_32F); final double SWTHRESH = 8.0;    // stroke width threshold Imgproc.threshold(dist, dibw32f, SWTHRESH/2.0, 255, Imgproc.THRESH_BINARY); Mat dibw8u = new Mat(im.size(), CvType.CV_8U); dibw32f.convertTo(dibw8u, CvType.CV_8U);  Mat kernel = Imgproc.getStructuringElement(Imgproc.MORPH_RECT, new Size(3, 3)); // open to remove connections to stray elements Mat cont = new Mat(im.size(), CvType.CV_8U); Imgproc.morphologyEx(dibw8u, cont, Imgproc.MORPH_OPEN, kernel); // find contours and filter based on bounding-box height final double HTHRESH = im.rows() * 0.5; // bounding-box height threshold List<MatOfPoint> contours = new ArrayList<MatOfPoint>(); List<Point> digits = new ArrayList<Point>();    // contours of the possible digits Imgproc.findContours(cont, contours, new Mat(), Imgproc.RETR_CCOMP, Imgproc.CHAIN_APPROX_SIMPLE); for (int i = 0; i < contours.size(); i++) {     if (Imgproc.boundingRect(contours.get(i)).height > HTHRESH)     {         // this contour passed the bounding-box height threshold. add it to digits         digits.addAll(contours.get(i).toList());     }    } // find the convexhull of the digit contours MatOfInt digitsHullIdx = new MatOfInt(); MatOfPoint hullPoints = new MatOfPoint(); hullPoints.fromList(digits); Imgproc.convexHull(hullPoints, digitsHullIdx); // convert hull index to hull points List<Point> digitsHullPointsList = new ArrayList<Point>(); List<Point> points = hullPoints.toList(); for (Integer i: digitsHullIdx.toList()) {     digitsHullPointsList.add(points.get(i)); } MatOfPoint digitsHullPoints = new MatOfPoint(); digitsHullPoints.fromList(digitsHullPointsList); // create the mask for digits List<MatOfPoint> digitRegions = new ArrayList<MatOfPoint>(); digitRegions.add(digitsHullPoints); Mat digitsMask = Mat.zeros(im.size(), CvType.CV_8U); Imgproc.drawContours(digitsMask, digitRegions, 0, new Scalar(255, 255, 255), -1); // dilate the mask to capture any info we lost in earlier opening Imgproc.morphologyEx(digitsMask, digitsMask, Imgproc.MORPH_DILATE, kernel); // cleaned image ready for OCR Mat cleaned = Mat.zeros(im.size(), CvType.CV_8U); dibw8u.copyTo(cleaned, digitsMask); // feed cleaned to Tesseract 
like image 56
dhanushka Avatar answered Oct 10 '22 09:10

dhanushka