Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting lines from an image to feed to OCR - Tesseract

I was watching this talk from pycon http://youtu.be/B1d9dpqBDVA?t=15m34s around the 15:33 mark the speaker talks about extracting lines from an image (receipt) and then feeding that to the OCR engine so that text can be extracted in a better way.

I have a similar need where I'm passing images to the OCR engine. However, I don't quite understand what he means by extracting lines from an image. What are some open source tools that I can use to extract lines from an image?

like image 474
birdy Avatar asked Mar 28 '13 15:03

birdy


People also ask

How do I extract text from an image using Tesseract?

Create a Python tesseract script Create a project folder and add a new main.py file inside that folder. Once the application gives access to PDF files, its content will be extracted in the form of images. These images will then be processed to extract the text.

Is Tesseract good for OCR?

While Tesseract is known as one of the most accurate free OCR engines available today, it has numerous limitations that dramatically affect its performance; its ability to correctly recognize characters in a scan or image.

How can I get text from an image?

OCR is the “Optical Character Recognition” technology used to convert any image containing handwritten or printed readable text. Once the file has been processed through the online OCR, the extracted text can be further edited by using word processing software like MS Word.


1 Answers

Take a look at the technique used to detect the skew angle of a text.

Groups are lines are used to isolate text on an image (this is the interesting part).

From this result you can easily detect the upper/lower limits of each line of text. The text itself will be located inside them. I've faced a similar problem before, the code might be useful to you:

All you need to do from here is crop each pair of lines and feed that as an image to Tesseract.

like image 120
karlphillip Avatar answered Sep 18 '22 09:09

karlphillip