Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the best way to detect garbled text in an OCR-ed document

Are there any good NLP or statistical techniques for detecting garbled characters in OCR-ed text? Off the top of my head I was thinking that looking at the distribution of n-grams in text might be a good starting point but I'm pretty new to the whole NLP domain.

Here is what I've looked at so far:

  • N-gram Statistics in English and Chinese: Similarities and Differences
  • Statistical Distributions of English Text

The text will mostly be in english but a general solution would be nice. The text is currently indexed in Lucene so any ideas on a term based approach would be useful too.


Any suggestions would be great! Thanks!

like image 568
Luke Quinane Avatar asked Oct 12 '25 17:10

Luke Quinane


1 Answers

Yes, most powerful thing in that case is Ngrams. You should collect them on related text corpora (with same topic to your OCR texts). This problem is very similar to spellchecking - if small character change lead to great probability increase it was a mistake. Check this tutorial how to use ngram for spellchecking.

like image 136
Robin Goupil Avatar answered Oct 16 '25 05:10

Robin Goupil



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!