Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to correct OCR segmentation errors using bounding rectangles?

I'm using tesseract for OCR and have noticed, that sometimes segmentation errors occur and characters, that "obviously" belong together are split into separted strings.

Based on a list of characters and their bounding boxes found in one text line and the prilimanary OCR result suggesting, which of these characters belong to one word, which algorithms can I apply to correct segmentation errors or verify the result?

So this this the data available:

List<Word> words;
for(Word word : words){
    for(Char c : word.getChars()){
        char ch = c.getValue();
        Rectangle rect = c.getRect();
    }
}
like image 484
Pedro Avatar asked Apr 18 '12 14:04

Pedro


1 Answers

For OCR post-correction that takes into account the characters and words, but admittedly not the bounding boxes, one common practice is

  • to use a dictionary of valid words, as comprehensive as possible
  • to check the words that results from the OCR algorithm against that dictionary
  • if a word cannot be found as an exact match in the dictionary, to try and find a similar one

To make this possible, you need to prepare the dictionary implementation so it enables a search for similar strings, also known as approximate string matching or fuzzy string matching.

The two main approaches for this that I am aware of are

  • Levenshtein atomata, as described by Schulz et al (DOI: 10.1007/s10032-002-0082-8)
  • Metric trees, such as the BK tree described by Baeza-Yates and Navarro (DOI: 10.1109/SPIRE.1998.712978)

These approaches, as well as general approximate string matching approaches (such as search tries, q-grams matching and n-gram matching) all inherently use some kind of edit distance measure, more or less similar to Levenshtein distance. After analysing the specific OCR errors you are dealing with, you might want to adjust the edit distance algorithm and the other resources you are using to your specific needs. This may involve things like:

  • Assume a lower substitution distance between characters that your OCR program confuses frequently, or characters that look particularly similar when rendered with the font or style you are dealing with
  • Take possible segmentation errors into account by putting frequently occurring pairs of words in the dictionary (in addition to single words)
  • Ensure that your dictionary contains as many named entities and other domain specific (or corpus specific) elements

Further more, you can try to use a grammar, and/or a statistical language model, such as a Hidden Markov Model or Conditional Random Field model -- similar to the models used by POS taggers -- to make word corrections in context.

like image 118
jogojapan Avatar answered Oct 29 '22 16:10

jogojapan