I want to find a easy-to-use OCR python module in linux, I have found pytesser http://code.google.com/p/pytesser/, but it contains a .exe executable file.
I tried changed the code to use wine, and it really works, but it's too slow and really not a good idea.
Is there any Linux alternatives that as easy-to-use as it?
You can install the python wrapper for tesseract after this using pip. Tesseract library is shipped with a handy command-line tool called tesseract. We can use this tool to perform OCR on images and the output is stored in a text file.
Tesseract is available directly from many Linux distributions. The package is generally called 'tesseract' or 'tesseract-ocr' - search your distribution's repositories to find it. Packages for over 130 languages and over 35 scripts are also available directly from the Linux distributions.
Learn how to import the pytesseract package into your Python scripts. Use OpenCV to load an input image from disk. Pass the image into the Tesseract OCR engine via the pytesseract library. Display the OCR'd text results on our terminal.
Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and “read” the text embedded in images. Python-tesseract is a wrapper for Google's Tesseract-OCR Engine.
You can just wrap tesseract
in a function:
import os import tempfile import subprocess def ocr(path): temp = tempfile.NamedTemporaryFile(delete=False) process = subprocess.Popen(['tesseract', path, temp.name], stdout=subprocess.PIPE, stderr=subprocess.STDOUT) process.communicate() with open(temp.name + '.txt', 'r') as handle: contents = handle.read() os.remove(temp.name + '.txt') os.remove(temp.name) return contents
If you want document segmentation and more advanced features, try out OCRopus.
In addition to Blender's answer, that just executs Tesseract executable, I would like to add that there exist other alternatives for OCR that can also be called as external process.
ABBYY comand line OCR utility: http://ocr4linux.com/en:start
It is not free, so worth to consider only if Tesseract accuracy is not good enough for your task, or you need more sophisticated layout analisys or you need to export PDF, Word and other files.
Update: here's comparison of ABBYY and tesseract accuracy: http://www.splitbrain.org/blog/2010-06/15-linux_ocr_software_comparison
Disclaimer: I work for ABBYY
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With