I was watching this talk from pycon http://youtu.be/B1d9dpqBDVA?t=15m34s around the 15:33 mark the speaker talks about extracting lines from an image (receipt) and then feeding that to the OCR engine so that text can be extracted in a better way.
I have a similar need where I'm passing images to the OCR engine. However, I don't quite understand what he means by extracting lines from an image. What are some open source tools that I can use to extract lines from an image?
Create a Python tesseract script Create a project folder and add a new main.py file inside that folder. Once the application gives access to PDF files, its content will be extracted in the form of images. These images will then be processed to extract the text.
While Tesseract is known as one of the most accurate free OCR engines available today, it has numerous limitations that dramatically affect its performance; its ability to correctly recognize characters in a scan or image.
OCR is the “Optical Character Recognition” technology used to convert any image containing handwritten or printed readable text. Once the file has been processed through the online OCR, the extracted text can be further edited by using word processing software like MS Word.
Take a look at the technique used to detect the skew angle of a text.
Groups are lines are used to isolate text on an image (this is the interesting part).
From this result you can easily detect the upper/lower limits of each line of text. The text itself will be located inside them. I've faced a similar problem before, the code might be useful to you:
All you need to do from here is crop each pair of lines and feed that as an image to Tesseract.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With