I'm working on digitizing a large collection of scanned documents, working with Tesseract 3 as my OCR engine. The quality of its output is mediocre, as it often produces both garbage characters before and after the actual text, and misspellings within the text.
For the former problem, it seems like there must be strategies for determining which text is actually text and which text isn't (much of this text is things like people's names, so I'm looking for solutions other than looking up words in a dictionary).
For the typo problem, most of the errors stem from a few misclassifications of letters (substituting l
, 1
, and I
for one another, for instance), and it seems like there should be methods for guessing which words are misspelled (since not too many words in English have a "1" in the middle of them), and guessing what the appropriate correction is.
What are the best practices in this space? Are there free/open-source implementations of algorithms that do this sort of thing? Google has yielded lots of papers, but not much concrete. If there aren't implementations available, which of the many papers would be a good starting place?
OCR Error Types. OCR accuracy is negatively influenced by poor image quality (e.g., scanning resolution, noise) and any mismatch between the instances on which the character image classi- fier was trained and the rendering of the characters in the printed document (e.g., font, size, spacing).
Optical Character Recognition (OCR) Post Processing involves data cleaning steps for documents that were digitized, such as a book or a newspaper article. One step in this process is the identification and correction of spelling and grammar errors generated due to the flaws in the OCR system.
For "determining which text is actually text and which text isn't" you might want to look at rmgarbage
from same department that developed Tesseract (the ISRI). I've written a Perl implementation and there's also a Ruby implementation. For the 1 vs. l problem I'm experimenting with ocrspell
(again from the same department), for which their original source is available.
I can only post two links, so the missing ones are:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With