I'm using tesseract for OCR and have noticed, that sometimes segmentation errors occur and characters, that "obviously" belong together are split into separted strings.
Based on a list of characters and their bounding boxes found in one text line and the prilimanary OCR result suggesting, which of these characters belong to one word, which algorithms can I apply to correct segmentation errors or verify the result?
So this this the data available:
List<Word> words;
for(Word word : words){
for(Char c : word.getChars()){
char ch = c.getValue();
Rectangle rect = c.getRect();
}
}
For OCR post-correction that takes into account the characters and words, but admittedly not the bounding boxes, one common practice is
To make this possible, you need to prepare the dictionary implementation so it enables a search for similar strings, also known as approximate string matching or fuzzy string matching.
The two main approaches for this that I am aware of are
These approaches, as well as general approximate string matching approaches (such as search tries, q-grams matching and n-gram matching) all inherently use some kind of edit distance measure, more or less similar to Levenshtein distance. After analysing the specific OCR errors you are dealing with, you might want to adjust the edit distance algorithm and the other resources you are using to your specific needs. This may involve things like:
Further more, you can try to use a grammar, and/or a statistical language model, such as a Hidden Markov Model or Conditional Random Field model -- similar to the models used by POS taggers -- to make word corrections in context.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With