I've written an application that segments an image based on the text regions within it, and extracts those regions as I see fit. What I'm attempting to do is clean the image so OCR (Tesseract) gives an accurate result. I have the following image as an example:

Running this through tesseract gives a widely inaccurate result. However cleaning up the image (using photoshop) to get the image as follows:

Gives exactly the result I would expect. The first image is already being run through the following method to clean it to that point:
 public Mat cleanImage (Mat srcImage) {     Core.normalize(srcImage, srcImage, 0, 255, Core.NORM_MINMAX);     Imgproc.threshold(srcImage, srcImage, 0, 255, Imgproc.THRESH_OTSU);     Imgproc.erode(srcImage, srcImage, new Mat());     Imgproc.dilate(srcImage, srcImage, new Mat(), new Point(0, 0), 9);     return srcImage; } What more can I do to clean the first image so it resembles the second image?
Edit: This is the original image before it's run through the cleanImage function.

Noise removal is one of the steps in pre- processing. Among other things, noise reduces the accuracy of subsequent tasks of OCR (Optical character Recognition) systems. It can appear in the foreground or background of an image and can be generated before or after scanning.
After greying the image try applying equalize histogram to the image, this allows the area's in the image with lower contrast to gain a higher contrast. Then blur the image to reduce the noise in the background.
That’s why we enumerated the tools needed to help you how to remove text from image without removing background in the easiest way possible. One of the well-known and most used online photo editors that you can use is PicWish Photo Retouch.
Here's how to use our free audio editing software to remove the background noise from your audio. Upload your audio file, or start recording your audio directly on Podcastle. Cut the unnecessary parts of your audio, and make any other edits you want to make. Right-click on your audio track and click Add Magic Dust.
Using the background eraser, creators can remove any unwanted parts of an image automatically or manually. Using your computer, phone, or tablet, click Upload to find an image you have saved already, or paste a link to any image location on the web like Instagram, Google Drive, or Google Images.
Upload your MP3 file to Podcastle. Right-click on the audio track and choose Add Magic Dust. We'll remove all the background noise in no time. How to record a professional-quality podcast from your laptop?
My answer is based on following assumptions. It's possible that none of them holds in your case.
This is my procedure for extracting the digits:
 
  
 Threshold the distance transformed image using the stroke-width ( = 8) constraint 
Apply morphological operation to disconnect 
Filter bounding box heights and make a guess where the digits are
stroke-width = 8  stroke-width = 10
 stroke-width = 10 
EDIT
Prepare a mask using the convexhull of the found digit contours 
Copy digits region to a clean image using the mask
stroke-width = 8 
stroke-width = 10 
My Tesseract knowledge is a bit rusty. As I remember you can get a confidence level for the characters. You may be able to filter out noise using this information if you still happen to detect noisy regions as character bounding boxes.
C++ Code
Mat im = imread("aRh8C.png", 0); // apply Otsu threshold Mat bw; threshold(im, bw, 0, 255, CV_THRESH_BINARY_INV | CV_THRESH_OTSU); // take the distance transform Mat dist; distanceTransform(bw, dist, CV_DIST_L2, CV_DIST_MASK_PRECISE); Mat dibw; // threshold the distance transformed image double SWTHRESH = 8;    // stroke width threshold threshold(dist, dibw, SWTHRESH/2, 255, CV_THRESH_BINARY); Mat kernel = getStructuringElement(MORPH_RECT, Size(3, 3)); // perform opening, in case digits are still connected Mat morph; morphologyEx(dibw, morph, CV_MOP_OPEN, kernel); dibw.convertTo(dibw, CV_8U); // find contours and filter Mat cont; morph.convertTo(cont, CV_8U);  Mat binary; cvtColor(dibw, binary, CV_GRAY2BGR);  const double HTHRESH = im.rows * .5;    // height threshold vector<vector<Point>> contours; vector<Vec4i> hierarchy; vector<Point> digits; // points corresponding to digit contours  findContours(cont, contours, hierarchy, CV_RETR_CCOMP, CV_CHAIN_APPROX_SIMPLE, Point(0, 0)); for(int idx = 0; idx >= 0; idx = hierarchy[idx][0]) {     Rect rect = boundingRect(contours[idx]);     if (rect.height > HTHRESH)     {         // append the points of this contour to digit points         digits.insert(digits.end(), contours[idx].begin(), contours[idx].end());          rectangle(binary,              Point(rect.x, rect.y), Point(rect.x + rect.width - 1, rect.y + rect.height - 1),             Scalar(0, 0, 255), 1);     } }  // take the convexhull of the digit contours vector<Point> digitsHull; convexHull(digits, digitsHull); // prepare a mask vector<vector<Point>> digitsRegion; digitsRegion.push_back(digitsHull); Mat digitsMask = Mat::zeros(im.rows, im.cols, CV_8U); drawContours(digitsMask, digitsRegion, 0, Scalar(255, 255, 255), -1); // expand the mask to include any information we lost in earlier morphological opening morphologyEx(digitsMask, digitsMask, CV_MOP_DILATE, kernel); // copy the region to get a cleaned image Mat cleaned = Mat::zeros(im.rows, im.cols, CV_8U); dibw.copyTo(cleaned, digitsMask); EDIT
Java Code
Mat im = Highgui.imread("aRh8C.png", 0); // apply Otsu threshold Mat bw = new Mat(im.size(), CvType.CV_8U); Imgproc.threshold(im, bw, 0, 255, Imgproc.THRESH_BINARY_INV | Imgproc.THRESH_OTSU); // take the distance transform Mat dist = new Mat(im.size(), CvType.CV_32F); Imgproc.distanceTransform(bw, dist, Imgproc.CV_DIST_L2, Imgproc.CV_DIST_MASK_PRECISE); // threshold the distance transform Mat dibw32f = new Mat(im.size(), CvType.CV_32F); final double SWTHRESH = 8.0;    // stroke width threshold Imgproc.threshold(dist, dibw32f, SWTHRESH/2.0, 255, Imgproc.THRESH_BINARY); Mat dibw8u = new Mat(im.size(), CvType.CV_8U); dibw32f.convertTo(dibw8u, CvType.CV_8U);  Mat kernel = Imgproc.getStructuringElement(Imgproc.MORPH_RECT, new Size(3, 3)); // open to remove connections to stray elements Mat cont = new Mat(im.size(), CvType.CV_8U); Imgproc.morphologyEx(dibw8u, cont, Imgproc.MORPH_OPEN, kernel); // find contours and filter based on bounding-box height final double HTHRESH = im.rows() * 0.5; // bounding-box height threshold List<MatOfPoint> contours = new ArrayList<MatOfPoint>(); List<Point> digits = new ArrayList<Point>();    // contours of the possible digits Imgproc.findContours(cont, contours, new Mat(), Imgproc.RETR_CCOMP, Imgproc.CHAIN_APPROX_SIMPLE); for (int i = 0; i < contours.size(); i++) {     if (Imgproc.boundingRect(contours.get(i)).height > HTHRESH)     {         // this contour passed the bounding-box height threshold. add it to digits         digits.addAll(contours.get(i).toList());     }    } // find the convexhull of the digit contours MatOfInt digitsHullIdx = new MatOfInt(); MatOfPoint hullPoints = new MatOfPoint(); hullPoints.fromList(digits); Imgproc.convexHull(hullPoints, digitsHullIdx); // convert hull index to hull points List<Point> digitsHullPointsList = new ArrayList<Point>(); List<Point> points = hullPoints.toList(); for (Integer i: digitsHullIdx.toList()) {     digitsHullPointsList.add(points.get(i)); } MatOfPoint digitsHullPoints = new MatOfPoint(); digitsHullPoints.fromList(digitsHullPointsList); // create the mask for digits List<MatOfPoint> digitRegions = new ArrayList<MatOfPoint>(); digitRegions.add(digitsHullPoints); Mat digitsMask = Mat.zeros(im.size(), CvType.CV_8U); Imgproc.drawContours(digitsMask, digitRegions, 0, new Scalar(255, 255, 255), -1); // dilate the mask to capture any info we lost in earlier opening Imgproc.morphologyEx(digitsMask, digitsMask, Imgproc.MORPH_DILATE, kernel); // cleaned image ready for OCR Mat cleaned = Mat.zeros(im.size(), CvType.CV_8U); dibw8u.copyTo(cleaned, digitsMask); // feed cleaned to Tesseract If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With