Return text string from physical coordinates in a PDF with Python

Tags:

I have been battling with Google and the limited documentation of PDFMiner for the last several hours, and although I feel close, I'm just not getting what I need. I've worked through http://www.unixuser.org/~euske/python/pdfminer/ and all three of the YouTube videos to gain a better understanding about PDFs and I'm able to output raw text just fine.

I am working on a script to parse multiple PDF pages. Unfortunately, for this project I am dealing with poor quality PDF files, and the only reliable constant I see is the physical location of text strings being exactly the same. Although I've read hints that text strings can be extracted by physical coords, I have yet to see a working example.

Is there anyone out there who could shed some light on how this is done with PDFMiner? I am open to other modules if there is an obvious better choice, however I need to stick with Python for the script.

Additionally, I have tried PyPdf to no success as well (other than basic text output).

Thanks!

341

asked Feb 18 '12 18:02

user1145643

2 Answers

I've been writing a library to try to simplify this process, pdfquery. To extract text from a particular place in a particular page, you would do:

pdf = pdfquery.PDFQuery(file)
# load first, third, fourth pages
pdf.load(0, 2, 3) 
# find text between 100 and 300 points from left bottom corner of first page
text = pdf.pq('LTPage[page_index=0] :in_bbox("100,100,300,300")').text() 
# save tree as XML to try to figure out why the last line didn't work the way you expected :)
pdf.tree.write(filename, pretty_print=True)

If you want to find individual characters within that box, instead of text lines entirely within that box, pass merge_tags=None to PDFQuery (by default it merges consecutive characters into a single element to make the tree less ridiculous, so the whole line would have to be inside the box). If you want to find anything that partially overlaps the box, use :overlaps_bbox instead of :in_bbox.

This is basically using PyQuery selector syntax to grab text from a PDFMiner layout, so if your document is too messy for PDFMiner, it may be too messy for this as well, but at least it will be faster to play with.

106

answered Nov 10 '22 06:11

Jack Cushman

I was able to find my way around pdfminer thanks to some code by Denis Papathanasiou. The code is discussed in his blog, and you can find the source here: layout_scanner.py

In particular, take a look at the method parse_lt_objs( ). In the final loop, k should be a pair containing the coordinates of that bit of text (and it is discarded). I don't have a working coordinate extractor to post here (I was not interested in them), but it sounds like you'll have no trouble finding your way from there.

Good luck with it!

answered Nov 10 '22 06:11

alexis

Related questions
                            
                                py2cairo installation failure. Checking for 'cairo' >= 1.10.0 : not found
                            
                                Creating an instance of type(self) dynamically without calling __init__?
                            
                                "self" in python lambda expression
                            
                                Python `print` passing extra text to sys.stdout?
                            
                                How to handle multibyte string in Python
                            
                                How to Combine Each of the Elements of Two Lists in Python?
                            
                                Alternative to Python string replace method
                            
                                Python subprocess call with arguments having multiple quotations
                            
                                Multithreaded file copy is far slower than a single thread on a multicore CPU
                            
                                threading: It is not safe to use pixmaps outside the GUI thread
                            
                                How to read the file contents from a file?
                            
                                Conversion of unix epoch time to windows epoch time in python
                            
                                Python, hstack column numpy arrays (column vectors) of different types
                            
                                How does the name of an immutable object rebind to the result of an augmented assignment?
                            
                                Is there a more pythonic way of storing parameters so they can be used in a function call?
                            
                                Comparing lists of dictionaries
                            
                                regression testing the entire app in Python
                            
                                Clear Clipboard?
                            
                                Differentiating between signal sources in PySide
                            
                                python program to export numpy/lists in svmlight format

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With