Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python OCR Module in Linux?

Tags:

I want to find a easy-to-use OCR python module in linux, I have found pytesser http://code.google.com/p/pytesser/, but it contains a .exe executable file.

I tried changed the code to use wine, and it really works, but it's too slow and really not a good idea.

Is there any Linux alternatives that as easy-to-use as it?

like image 345
Felix Yan Avatar asked Apr 27 '11 05:04

Felix Yan


People also ask

How do I use OCR in Python?

You can install the python wrapper for tesseract after this using pip. Tesseract library is shipped with a handy command-line tool called tesseract. We can use this tool to perform OCR on images and the output is stored in a text file.

Does Tesseract work on Linux?

Tesseract is available directly from many Linux distributions. The package is generally called 'tesseract' or 'tesseract-ocr' - search your distribution's repositories to find it. Packages for over 130 languages and over 35 scripts are also available directly from the Linux distributions.

How do I run a Tesseract in Python?

Learn how to import the pytesseract package into your Python scripts. Use OpenCV to load an input image from disk. Pass the image into the Tesseract OCR engine via the pytesseract library. Display the OCR'd text results on our terminal.

What is Tesseract Python?

Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and “read” the text embedded in images. Python-tesseract is a wrapper for Google's Tesseract-OCR Engine.


2 Answers

You can just wrap tesseract in a function:

import os import tempfile import subprocess  def ocr(path):     temp = tempfile.NamedTemporaryFile(delete=False)      process = subprocess.Popen(['tesseract', path, temp.name], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)     process.communicate()      with open(temp.name + '.txt', 'r') as handle:         contents = handle.read()      os.remove(temp.name + '.txt')     os.remove(temp.name)      return contents 

If you want document segmentation and more advanced features, try out OCRopus.

like image 174
Blender Avatar answered Oct 24 '22 05:10

Blender


In addition to Blender's answer, that just executs Tesseract executable, I would like to add that there exist other alternatives for OCR that can also be called as external process.

ABBYY comand line OCR utility: http://ocr4linux.com/en:start

It is not free, so worth to consider only if Tesseract accuracy is not good enough for your task, or you need more sophisticated layout analisys or you need to export PDF, Word and other files.

Update: here's comparison of ABBYY and tesseract accuracy: http://www.splitbrain.org/blog/2010-06/15-linux_ocr_software_comparison

Disclaimer: I work for ABBYY

like image 35
Tomato Avatar answered Oct 24 '22 05:10

Tomato