Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Image cleaning before OCR application

I have been experimenting with PyTesser for the past couple of hours and it is a really nice tool. Couple of things I noticed about the accuracy of PyTesser:

  1. File with icons, images and text - 5-10% accurate
  2. File with only text(images and icons erased) - 50-60% accurate
  3. File with stretching(And this is the best part) - Stretching file in 2) above on x or y axis increased the accuracy by 10-20%

So apparently Pytesser does not take care of font dimension or image stretching. Although there is much theory to be read about image processing and OCR, are there any standard procedures of image cleanup(apart from erasing icons and images) that needs to be done before applying PyTesser or other libraries irrespective of the language?

...........

Wow, this post is quite old now. I started my research again on OCR these last couple of days. This time I chucked PyTesser and used the Tesseract Engine with ImageMagik instead. Coming straight to the point, this is what I found:

1) You can increase the resolution with ImageMagic(There are a bunch of simple shell commands you can use)
2) After increasing the resolution, the accuracy went up by 80-90%.

So the Tesseract Engine is without doubt the best open source OCR engine in the market. No prior image cleaning was required here. The caveat is that it does not work on files with a lot of embedded images and I coudn't figure out a way to train Tesseract to ignore them. Also the text layout and formatting in the image makes a big difference. It works great with images with just text. Hope this helped.

like image 418
zenCoder Avatar asked Oct 28 '13 16:10

zenCoder


People also ask

How do I know if my OCR is accurate?

Measuring OCR accuracy is done by taking the output of an OCR run for an image and comparing it to the original version of the same text. You can then either count how many characters were detected correctly (character level accuracy), or count how many words were recognized correctly (word level accuracy).


1 Answers

Not sure if your intent is for commercial use or not, But this works wonders if your performing OCR on a bunch of like images.

http://www.fmwconcepts.com/imagemagick/textcleaner/index.php

ORIGINAL ORIGINAL

After Pre-Processing with given arguments.

After Pre-Processing with given arguments.

like image 80
Milne Avatar answered Sep 21 '22 12:09

Milne