Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Introduction to OCR

Tags:

ocr

Someone gave me a trove full of amazing information. It is 200MB .tiff images of scanned announcements that goes back until the 40's. I want to digitize this, but I have no knowledge whatsoever about OCR. Some of the early material is barely readable by a human, let alone a machine. It is also in Hebrew.

I'm looking for advice on how to approach this. A good suggestion about books, articles, code libraries or software (all of them should be available freely on the web). I'm proficient in C++ and Python and can pick up another language if it is needed.

Thank you.

like image 303
CamelCamelCamel Avatar asked Apr 30 '11 22:04

CamelCamelCamel


1 Answers

This sounds like a great task for Python, using an OCR library. A quick Google search turned up pytesser:

PyTesser is an Optical Character Recognition module for Python. It takes as input an image or image file and outputs a string.

PyTesser uses the Tesseract OCR engine, converting images to an accepted format and calling the Tesseract executable as an external script. A Windows executable is provided along with the Python scripts. The scripts should work in other operating systems as well.

...

Usage Example

>>> from pytesser import *
>>> image = Image.open('fnord.tif')  # Open image object using PIL
>>> print image_to_string(image)     # Run tesseract.exe on image
fnord
>>> print image_file_to_string('fnord.tif')
fnord
like image 92
Matt Ball Avatar answered Sep 20 '22 12:09

Matt Ball