Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Image processing for OCR with leptonica (inverse color text)

I am trying to process the following image with leptonica to extract text with tesseract.

Original Image: original image

Tesseract on the original image yields this:

i s l
D2J1FiiE-l191x1iitmwii9 uhiaiislz-2 Q ~37
Bottom linez
With a little time!
you can learn social media technology
using free online resources-
And if you donity
youlll be at a significant disadvantage
to
other HOn-pFOiiTS-

Not great, especially the top background. So using leptionica I use a background removal algorithm (blur, difference, threshold, invert) to get the following image: processed image

But tesseract doesn't do a good job with it:

@@r-mair lkrm@W lh@w ilr@ mJs@ iklh@ ii@c2lhm1@ll
mm Mime
VWU1 a Mitt-Jle time-
@1m ll@@Wn Om @@@lh1
using free onhne resources-
Andifyoudoni
9110 ate a $0 D
to other non-profrts
I

The main problem, it seems, is that now all of the text is outlined instead of solid. How can I adjust my algorithm or what can I add to made the text solid?

like image 481
jasonlfunk Avatar asked Jul 26 '12 21:07

jasonlfunk


People also ask

Why is the Tesseract OCR not accurate?

Inevitably, noise in an input image, non-standard fonts that Tesseract wasn't trained on, or less than ideal image quality will cause Tesseract to make a mistake and incorrectly OCR a piece of text.

Is Tesseract OCR good?

While Tesseract is known as one of the most accurate free OCR engines available today, it has numerous limitations that dramatically affect its performance; its ability to correctly recognize characters in a scan or image.


1 Answers

It seems that this paper proposes a binarization method which solves your problem:

T Kasar, J Kumar and A G Ramakrishnan. Font and Background Color Independent Text Binarization. (2007)

Kasar etal method performance

like image 90
sastanin Avatar answered Nov 14 '22 07:11

sastanin