What's the best way to detect garbled text in an OCR-ed document

Question

Are there any good NLP or statistical techniques for detecting garbled characters in OCR-ed text? Off the top of my head I was thinking that looking at the distribution of n-grams in text might be a good starting point but I'm pretty new to the whole NLP domain.

Here is what I've looked at so far:

N-gram Statistics in English and Chinese: Similarities and Differences
Statistical Distributions of English Text

The text will mostly be in english but a general solution would be nice. The text is currently indexed in Lucene so any ideas on a term based approach would be useful too.

Any suggestions would be great! Thanks!

Robin Goupil · Accepted Answer

Yes, most powerful thing in that case is Ngrams. You should collect them on related text corpora (with same topic to your OCR texts). This problem is very similar to spellchecking - if small character change lead to great probability increase it was a mistake. Check this tutorial how to use ngram for spellchecking.

What's the best way to detect garbled text in an OCR-ed document

Tags:

text

statistics

nlp

ocr

Luke Quinane

1 Answers

Robin Goupil

Recent Activity

Donate For Us

What's the best way to detect garbled text in an OCR-ed document

Tags:

text

statistics

nlp

ocr

Luke Quinane

1 Answers

Robin Goupil

Related questions

Recent Activity

Donate For Us