Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OCR error correction algorithms

Tags:

I'm working on digitizing a large collection of scanned documents, working with Tesseract 3 as my OCR engine. The quality of its output is mediocre, as it often produces both garbage characters before and after the actual text, and misspellings within the text.

For the former problem, it seems like there must be strategies for determining which text is actually text and which text isn't (much of this text is things like people's names, so I'm looking for solutions other than looking up words in a dictionary).

For the typo problem, most of the errors stem from a few misclassifications of letters (substituting l, 1, and I for one another, for instance), and it seems like there should be methods for guessing which words are misspelled (since not too many words in English have a "1" in the middle of them), and guessing what the appropriate correction is.

What are the best practices in this space? Are there free/open-source implementations of algorithms that do this sort of thing? Google has yielded lots of papers, but not much concrete. If there aren't implementations available, which of the many papers would be a good starting place?

like image 276
Andrew Pendleton Avatar asked Apr 13 '11 22:04

Andrew Pendleton


People also ask

What is an OCR error?

OCR Error Types. OCR accuracy is negatively influenced by poor image quality (e.g., scanning resolution, noise) and any mismatch between the instances on which the character image classi- fier was trained and the rendering of the characters in the printed document (e.g., font, size, spacing).

What is post processing in OCR?

Optical Character Recognition (OCR) Post Processing involves data cleaning steps for documents that were digitized, such as a book or a newspaper article. One step in this process is the identification and correction of spelling and grammar errors generated due to the flaws in the OCR system.


1 Answers

For "determining which text is actually text and which text isn't" you might want to look at rmgarbage from same department that developed Tesseract (the ISRI). I've written a Perl implementation and there's also a Ruby implementation. For the 1 vs. l problem I'm experimenting with ocrspell (again from the same department), for which their original source is available.

I can only post two links, so the missing ones are:

  • ocrspell: enter "10.1007/PL00013558" at dx.doi.org]
  • rmgarbage: search for "Automatic Removal of Garbage Strings in OCR Text: An Implementation"
  • ruby implementation: search for "docsplit textcleaner"
like image 78
ZakW Avatar answered Oct 03 '22 17:10

ZakW