text layout recognition with python

Tags:

I'm trying to sort through several thousand scanned files and sort them into folders based on type (ie: if one of the files is a scanned copy of formA, then it should go in the formA folder, if it's a scanned copy of formB, then it should go in the formB folder, etc...). I feel like the best way to match the files and types is based on their text outlines, but am totally new to image processing, so if there's a better solution, then I'm all ears.

I'm working in python. Any ideas of a best way to do this? PIL? OpenCV? imageMagick?

Thanks in advance...

513

asked Jul 11 '11 20:07

danwoods

2 Answers

This library is probably of interest to you -
http://code.google.com/p/ocropus/
Its made by googlers and lets you do OCR and layout analysis from python.
I had some trouble installing it, but that was quite a while back, so things may have gotten fixed by now.

132

answered Nov 14 '22 13:11

Aditya Mukherji

I don't know in what format you've got the scanned documents, but pdfminer can do layout analysis for pdf. I guess it would fit the bill for your purpose, provided you get the documents in somewhat decent pdf format (if you've just got "pure images", it won't do you any good)

answered Nov 14 '22 14:11

Steven

Related questions
                            
                                Python 2 to 3 bytes/string error
                            
                                .so module doesnt import in python: dynamic module does not define init function
                            
                                Is it possible to edit doc files with Python?
                            
                                Generate all possible numpad/keypad sequences
                            
                                Is it possible to build exe on Vista and deploy on XP using py2exe
                            
                                Fill Django Form Field Data with Db Data
                            
                                How can I install a PyPi equivalent from scratch?
                            
                                erlang interface to python
                            
                                How do I put a task back in the queue if the task fails?
                            
                                Django i18n find supported languages
                            
                                openpyxl cell style not reporting correctly
                            
                                How to extend pyWavelets to work with N-dimensional data?
                            
                                Can I extend Jenkins with Jython/Python
                            
                                Ruby's watchr equivalent in Python?
                            
                                What sample application demonstrates best practices for MVC structure in a Google App Engine/Python app?
                            
                                Generate REST based service from database schema [closed]
                            
                                Twisted deferred vs blocking in web services
                            
                                PyGTK blocking non-GUI threads
                            
                                Python script to remove blank pages using pyPDF
                            
                                Static html Files in Cherrypy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

text layout recognition with python

Tags:

python

image-processing

ocr

document-layout-analysis

danwoods

People also ask

2 Answers

Aditya Mukherji

Steven

Recent Activity

Donate For Us