OCR error correction algorithms

Tags:

I'm working on digitizing a large collection of scanned documents, working with Tesseract 3 as my OCR engine. The quality of its output is mediocre, as it often produces both garbage characters before and after the actual text, and misspellings within the text.

For the former problem, it seems like there must be strategies for determining which text is actually text and which text isn't (much of this text is things like people's names, so I'm looking for solutions other than looking up words in a dictionary).

For the typo problem, most of the errors stem from a few misclassifications of letters (substituting l, 1, and I for one another, for instance), and it seems like there should be methods for guessing which words are misspelled (since not too many words in English have a "1" in the middle of them), and guessing what the appropriate correction is.

What are the best practices in this space? Are there free/open-source implementations of algorithms that do this sort of thing? Google has yielded lots of papers, but not much concrete. If there aren't implementations available, which of the many papers would be a good starting place?

276

asked Apr 13 '11 22:04

Andrew Pendleton

1 Answers

For "determining which text is actually text and which text isn't" you might want to look at rmgarbage from same department that developed Tesseract (the ISRI). I've written a Perl implementation and there's also a Ruby implementation. For the 1 vs. l problem I'm experimenting with ocrspell (again from the same department), for which their original source is available.

I can only post two links, so the missing ones are:

ocrspell: enter "10.1007/PL00013558" at dx.doi.org]
rmgarbage: search for "Automatic Removal of Garbage Strings in OCR Text: An Implementation"
ruby implementation: search for "docsplit textcleaner"

answered Oct 03 '22 17:10

ZakW

Related questions
                            
                                MySQL Make a combination of columns unique
                            
                                Multi-line TODO: Comments in Eclipse
                            
                                Onion Architecture
                            
                                What does "fatal: You are on a branch yet to be born" mean in this context
                            
                                How to contribute modules in Play Framework 2.0?
                            
                                cygwin bash does not display correctly in emacs shell
                            
                                Is there a way to "override" a method with reflection?
                            
                                unicorn request queuing
                            
                                Center Navbar in Twitter Bootstrap 2.0
                            
                                NSFetchedResultsController doesn't show updates from a different context
                            
                                Override CSS media queries
                            
                                Is there a library call to addr2line? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With