Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OCR: segmentation of small text

The problem

I've been building a (very) simple OCR engine. Since I'm trying to classify very small (pixel size) characters, I'm having some difficulties on segmentation. Here's an example, after best-effort image-wide thresholding:

image of problematic segmentation on 63:

What I've tried

Error detection:

  • large horizontal size of the segments. It works, mostly, but fails (false positive) for a few larger characters.
  • classify, and reject on low score. This seems a bit wasteful.

Error correction:

  • add pixels vertically (vertical histogram), find minimum. It cuts many segments on the wrong place, in many of the samples.

What I haven't tried yet

  • Trying to classify on all possible segmentation points (pixels). This would be very wasteful, and be difficult to expand for a 3-merged-characters segment.
  • I've been reading up on morphology approaches to turn the characters into mathematical curves, but I don't know really know where to start, or if it's worth the effort

Where to go from here?

I have no idea. Hence this question :)

like image 958
loopbackbee Avatar asked Dec 22 '12 04:12

loopbackbee


People also ask

What is segmentation in OCR?

Segmentation is nothing but breaking the whole image into subparts to process them further. Segmentation of image is done in the following sequence : → Line level Segmentation. → Word level Segmentation. → Character level Segmentation.

Why is the Tesseract OCR not accurate?

Inevitably, noise in an input image, non-standard fonts that Tesseract wasn't trained on, or less than ideal image quality will cause Tesseract to make a mistake and incorrectly OCR a piece of text.

Is Tesseract OCR good?

Tesseract does various image processing operations internally (using the Leptonica library) before doing the actual OCR. It generally does a very good job of this, but there will inevitably be cases where it isn't good enough, which can result in a significant reduction in accuracy.


2 Answers

Lean back and half close your eyes.

63 :-)

Now, if only it was so easy for a computer!

It's tantalisingly close to what double-patterning does (or un-does?) in silicon masks.

I would suggest oversampling (doubling or quadrupling the pixel count in each axis), filtering (probably low pass - or possibly bandpass where the passband = spatial frequency of a line), re-thresholding until they separate. Expensive, so only apply in problem areas.

like image 177
user_1818839 Avatar answered Oct 19 '22 20:10

user_1818839


Reinvent your problem so you do not need segmentation.

Really, for this scale I think you better invest in other approaches. For example, if you OCR on text (do you?) you can use the information of lines (character height). There are not many fonts that can be used for small (yet readable) characters. My approach would be a algorithm that scan lines in scanlines (from left to right, take pixels from top to bottom) and try to find correlations between trained text and scanlines (n, n-1... n-x)

And you probably need the information I the grayscale levels as well, so better not to threshold the images.

like image 25
Rob Audenaerde Avatar answered Oct 19 '22 20:10

Rob Audenaerde