Extract text from PDF

Tags:

I have a bunch of PDF files that I need to convert to TXT. Unfortunately, when i use one of the many available utilities to do this, it loses all formatting and all the tabulated data in the PDF gets jumbled up. Is it possible to use Python to extract the text from the PDF by specifying postions, etc?

Thanks.

366

asked Jun 30 '10 11:06

Mridang Agarwalla

1 Answers

PDFs do not contain tabular data unless it contains structured content. Some tools include heuristics to try and guess the data structure and put it back. I wrote a blog article explaining the issues with PDF text extraction at http://www.jpedal.org/PDFblog/2009/04/pdf-text/

102

answered Oct 06 '22 01:10

mark stephens

Related questions
                            
                                siginterrupt() only works for the first signal? (Python)
                            
                                Detailed explanation about Python's "freeze"
                            
                                Python 3: Write newlines to HTML
                            
                                Strategies for debugging numerical stability issues?
                            
                                How to install mysql connector [duplicate]
                            
                                What is the default content-type/charset?
                            
                                How can I disable clear of clipboard on exit of PyQt application?
                            
                                What's the newest way to develop gnome panel applets (using python)
                            
                                How can I serialize Python objects to XML?
                            
                                What are the advantages of using Django insead of app-engine's default web framework?
                            
                                Changing contents of currently displayed listbox in urwid/python2.6
                            
                                Fastest Way to Update a bunch of records in queryset in Django
                            
                                How to get the system library path on Unix (Linux, FreeBSD)
                            
                                Clearing Django form fields on form validation error?
                            
                                wxPython change field on tab
                            
                                Controlling the fan speed and detecting the inside temperature of the pc with python?
                            
                                What speech libraries are available in Linux? [closed]
                            
                                Python win32com opening Excel with Bloomberg plugin
                            
                                What is the best way to implement a 'last seen' function in a django web app?
                            
                                Python - Open default mail client using mailto, with multiple recipients

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Extract text from PDF

Tags:

python

pdf

Mridang Agarwalla

People also ask

1 Answers

mark stephens

Recent Activity

Donate For Us