Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cleaning up an image for OCR with ImageMagick and 'textcleaner'

I have the following image that I'd like to prepare for an OCR with tesseract: enter image description here

The objective is to clean up the image and remove all of the noise. I'm using the textcleaner script that uses ImageMagick with the following parameters:

./textcleaner -g -e normalize -f 30 -o 12 -s 2 original.jpg output.jpg

The output is still not so clean: enter image description here

I tried all kinds of variations for the parameters but with no luck. Does anyone have an idea?

like image 936
Edi Avatar asked May 14 '15 20:05

Edi


1 Answers

If you convert to JPEG, you will always have the type of artifacts you are seeing.

This is a typical "feature" of JPEG compression. JPEGs are never good for images showing sharp lines, contrasts with uniform colors between different areas of the image, using only very few colors. This is true for black + white texts. JPEG is only "good" for typical photos, with lots of different colors and shading...

Your problem will most likely completely get resolved if you use PNG as an output format. The following image demonstrates this. I generated it with the same parameters as your last example command used, but with PNG as the output format:

textcleaner -g -e normalize -f 30 -o 12 -s 2 \
    http://i.stack.imgur.com/ficx7.jpg       \
    out.png

PNG instead of JPEG output

Here is a similar zoom into the output:

Zoomed PNG

You can very likely improve the output even more if you play with the parameters of the textcleaner script. But that is your job... :-)

like image 188
Kurt Pfeifle Avatar answered Nov 17 '22 18:11

Kurt Pfeifle