Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clean text images with OpenCV for OCR reading

I received some images that need to be treated in order to OCR some information out of them. Here are the originals:

original 1

original 1

original 2

original 2

original 3

original 3

original 4

original 4

After processing them with this code:

img = cv2.imread('original_1.jpg', 0) 
ret,thresh = cv2.threshold(img,55,255,cv2.THRESH_BINARY)
opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, cv2.getStructuringElement(cv2.MORPH_RECT,(2,2)))
cv2.imwrite('result_1.jpg', opening)

I get these results:

result 1

result 1

result 2

result 2

result 3

result 3

result 4

result 4

As you can see, some images get nice results for OCR reading, other still maintain some noise in the background.

Any suggestions as how to clean up the background?

like image 551
SteelMasimo Avatar asked Jan 22 '20 16:01

SteelMasimo


People also ask

How do I remove text from an image in cv2?

In order to erase text from images we will go through three steps: Identify text in the image and obtain the bounding box coordinates of each text, using Keras-ocr. For each bounding box, apply a mask to tell the algorithm which part of the image we should inpaint.

Can OpenCV read text from image?

OpenCV package is used to read an image and perform certain image processing techniques. Python-tesseract is a wrapper for Google's Tesseract-OCR Engine which is used to recognize text from images. Download the tesseract executable file from this link.


2 Answers

MH304's answer is very nice and straightforward. In the case you can't use morphology or blurring to get a cleaner image, consider using an "Area Filter". That is, filter every blob that does not exhibit a minimum area.

Use opencv's connectedComponentsWithStats, here's a C++ implementation of a very basic area filter:

cv::Mat outputLabels, stats, img_color, centroids;

int numberofComponents = cv::connectedComponentsWithStats(bwImage, outputLabels, 
stats, centroids, connectivity);

std::vector<cv::Vec3b> colors(numberofComponents+1);
colors[i] = cv::Vec3b(rand()%256, rand()%256, rand()%256);

//do not count the original background-> label = 0:
colors[0] = cv::Vec3b(0,0,0);

//Area threshold:
int minArea = 10; //10 px

for( int i = 1; i <= numberofComponents; i++ ) {

    //get the area of the current blob:
    auto blobArea = stats.at<int>(i-1, cv::CC_STAT_AREA);

    //apply the area filter:
    if ( blobArea < minArea )
    {
        //filter blob below minimum area:
        //small regions are painted with (ridiculous) pink color
        colors[i-1] = cv::Vec3b(248,48,213);

    }

}

Using the area filter I get this result on your noisiest image:

enter image description here

**Additional info:

Basically, the algorithm goes like this:

  • Pass a binary image to connectedComponentsWithStats. The function will compute the number of connected components, matrix of labels and an additional matrix with statistics – including blob area.

  • Prepare a color vector of size “numberOfcomponents”, this will help visualize the blobs that we are actually filtering. The colors are generated randomly by the rand function. From a range 0 – 255, 3 values for each pixel: BGR.

  • Consider that the background is colored in black, so ignore this “connected component” and its color (black).

  • Set an area threshold. All blobs or pixels below this area will be colored with a (ridiculous) pink.

  • Loop thru all the found connected components (blobs), retrive the area for the current blob via the stats matrix and compare it to the area threshold.

  • If the area is below the threshold, color the blob pink (in this case, but usually you want black).

like image 179
stateMachine Avatar answered Sep 30 '22 07:09

stateMachine


This is a fully coded Python solution based on the direction provided by @eldesgraciado.

This code assumes that you are already working with the properly binarized white-on-black image (e.g. after grayscale conversion, black hat morphing and Otsu's thesholding) - OpenCV documentation recommends working with the binarized images with the white foreground when applying morphological operations and stuff like that.

num_comps, labeled_pixels, comp_stats, comp_centroids = \
    cv2.connectedComponentsWithStats(thresh_image, connectivity=4)
min_comp_area = 10 # pixels
# get the indices/labels of the remaining components based on the area stat
# (skip the background component at index 0)
remaining_comp_labels = [i for i in range(1, num_comps) if comp_stats[i][4] >= min_comp_area]
# filter the labeled pixels based on the remaining labels, 
# assign pixel intensity to 255 (uint8) for the remaining pixels
clean_img = np.where(np.isin(labeled_pixels,remaining_comp_labels)==True,255,0).astype('uint8')

The advantage of this solution is that it allows you to filter out the noise without negatively affecting the characters that may already be compromised.

I work with dirty scans that have the undesirable effects like merged characters and character erosion, and I found out the hard way that there is no free lunch - even a seemingly harmless opening operation with the 3x3 kernel and one iteration results in some character degradation (despite being very effective for removing the noise around the characters).

So if the character quality allows, blunt cleanup operations on the entire image (e.g. blurring, opening, closing) are OK, but if not - this should be done first.

P.S. One more thing - you should not be using a lossy format like JPEG when working with text images, use a lossless format like PNG instead.

like image 31
Gene M Avatar answered Sep 30 '22 06:09

Gene M