Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove background color in image processing for OCR

I am trying to remove background color so as to improve the accuracy of OCR against images. A sample would look like below:

enter image description here

I'd keep all letters in the post-processed image while just removing the light purple color textured background. Is it possible to use some open source software such as Imagemagick to convert it to a binary image (black/white) to achieve this goal? What if the background has more than one color? Would the solution be the same?

Further, what if I also want to remove the purple letters (theater name) and the line so as to only keep the black color letters? Simple cropping might not work because the purple letters could appear at other places as well.

I am looking for a solution in programming, rather than via tools like Photoshop.

like image 232
charles Avatar asked Apr 01 '11 00:04

charles


4 Answers

You can do this using GIMP (or any other image editing tool).

  1. Open your image
  2. Convert to grayscale
  3. Duplicate the layer
  4. Apply Gaussian blur using a large kernel (10x10) to the top layer
  5. Calculate the image difference between the top and bottom layer
  6. Threshold the image to yield a binary image

Blurred image:

enter image description here

Difference image:

enter image description here

Binary:

enter image description here

If you're doing it as a once-off, GIMP is probably good enough. If you expect to do this many times over, you could probably write an imagemagick script or code up your approach using something like Python and OpenCV.

Some problems with the above approach:

  • The purple text (CENTURY) gets lost because it isn't as contrasting as the other text. You could work your way around it by thresholding different parts of the image differently, or by using local histogram manipulation methods
like image 129
mpenkov Avatar answered Nov 15 '22 07:11

mpenkov


The following shows a possible strategy for processing your image, and OCR it

The last step is doing an OCR. My OCR routine is VERY basic, so I'm sure you may get better results.

The code is Mathematica code.

enter image description here

Not bad at all!

like image 28
Dr. belisarius Avatar answered Nov 15 '22 06:11

Dr. belisarius


You can apply blur to the image, so you get almost clear background. Then divide each color component of each pixel of original image by the corresponding component of pixel on the background. And you will get text on white background. Additional postprocessing can help further.

This method works in the case if text is darker then the background (in each color component). Otherwise you can invert colors and apply this method.

like image 45
fdermishin Avatar answered Nov 15 '22 08:11

fdermishin


In Imagemagick, you can use the -lat function to do that.

convert image.jpg -colorspace gray -negate -lat 50x50+5% -negate result.jpg

enter image description here

convert image.jpg -colorspace HSB -channel 2 -separate +channel \
-white-threshold 35% \
-negate -lat 50x50+5% -negate \
-morphology erode octagon:1 result2.jpg

enter image description here

like image 26
fmw42 Avatar answered Nov 15 '22 07:11

fmw42