Introduction to OCR

Question

Someone gave me a trove full of amazing information. It is 200MB .tiff images of scanned announcements that goes back until the 40's. I want to digitize this, but I have no knowledge whatsoever about OCR. Some of the early material is barely readable by a human, let alone a machine. It is also in Hebrew.

I'm looking for advice on how to approach this. A good suggestion about books, articles, code libraries or software (all of them should be available freely on the web). I'm proficient in C++ and Python and can pick up another language if it is needed.

Thank you.

Matt Ball · Accepted Answer

This sounds like a great task for Python, using an OCR library. A quick Google search turned up pytesser:

PyTesser is an Optical Character Recognition module for Python. It takes as input an image or image file and outputs a string.

PyTesser uses the Tesseract OCR engine, converting images to an accepted format and calling the Tesseract executable as an external script. A Windows executable is provided along with the Python scripts. The scripts should work in other operating systems as well.

...

Usage Example
>>> from pytesser import *
>>> image = Image.open('fnord.tif')  # Open image object using PIL
>>> print image_to_string(image)     # Run tesseract.exe on image
fnord
>>> print image_file_to_string('fnord.tif')
fnord

Introduction to OCR

Tags:

ocr

CamelCamelCamel

1 Answers

Usage Example

Matt Ball

Recent Activity

Donate For Us

Introduction to OCR

Tags:

ocr

CamelCamelCamel

1 Answers

Usage Example

Matt Ball

Related questions

Recent Activity

Donate For Us