Converting a pdf to text/html in python so I can parse it

Tags:

I have the following sample code where I download a pdf from the European Parliament website on a given legislative proposal:

EDIT: I ended up just getting the link and feeding it to adobes online conversion tool (see the code below):

import mechanize
import urllib2
import re
from BeautifulSoup import *

adobe = "http://www.adobe.com/products/acrobat/access_onlinetools.html"

url = "http://www.europarl.europa.eu/oeil/search_reference_procedure.jsp"

def get_pdf(soup2):
    link = soup2.findAll("a", "com_acronym")
    new_link = []
    amendments = []
    for i in link:
        if "REPORT" in i["href"]:
            new_link.append(i["href"])
    if new_link == None:
        print "No A number"
    else:
        for i in new_link:
            page = br.open(str(i)).read()
            bs = BeautifulSoup(page)
            text = bs.findAll("a")
            for i in text:
                if re.search("PDF", str(i)) != None:
                    pdf_link = "http://www.europarl.europa.eu/" + i["href"]
            pdf = urllib2.urlopen(pdf_link)
            name_pdf = "%s_%s.pdf" % (y,p)
            localfile = open(name_pdf, "w")
            localfile.write(pdf.read())
            localfile.close()

            br.open(adobe)
            br.select_form(name = "convertFrm")
            br.form["srcPdfUrl"] = str(pdf_link)
            br["convertTo"] = ["html"]
            br["visuallyImpaired"] = ["notcompatible"]
            br.form["platform"] =["Macintosh"]
            pdf_html = br.submit()

            soup = BeautifulSoup(pdf_html)


page = range(1,2) #can be set to 400 to get every document for a given year
year = range(1999,2000) #can be set to 2011 to get documents from all years

for y in year:
    for p in page:
        br = mechanize.Browser()
        br.open(url)
        br.select_form(name = "byReferenceForm")
        br.form["year"] = str(y)
        br.form["sequence"] = str(p)
        response = br.submit()
        soup1 = BeautifulSoup(response)
        test = soup1.find(text="No search result")
        if test != None:
            print "%s %s No page skipping..." % (y,p)
        else:
            print "%s %s  Writing dossier..." % (y,p)
            for i in br.links(url_regex="file.jsp"):
                link = i
            response2 = br.follow_link(link).read()
            soup2 = BeautifulSoup(response2)
            get_pdf(soup2)

In the get_pdf() function I would like to convert the pdf file to text in python so I can parse the text for information about the legislative procedure. can anyone explaon me how this can be done?

Thomas

298

asked Sep 03 '10 16:09

Thomas Jensen

2 Answers

Sounds like you found a solution, but if you ever want to do it without a web service, or you need to scrape data based on its precise location on the PDF page, can I suggest my library, pdfquery? It basically turns the PDF into an lxml tree that can be spit out as XML, or parsed with XPath, PyQuery, or whatever else you want to use.

To use it, once you had the file saved to disk you would return pdf = pdfquery.PDFQuery(name_pdf), or pass in a urllib file object directly if you didn't need to save it. To get XML out to parse with BeautifulSoup, you could do pdf.tree.tostring().

If you don't mind using JQuery-style selectors, there's a PyQuery interface with positional extensions, which can be pretty handy. For example:

balance = pdf.pq(':contains("Your balance is")').text()
strings_near_the_bottom_of_page_23 = [el.text for el in pdf.pq('LTPage[page_label=23] :in_bbox(0, 0, 600, 200)')]

109

answered Oct 04 '22 01:10

Jack Cushman

It's not exactly magic. I suggest

downloading the PDF file to a temp directory,
calling out to an external program to extract the text into a (temp) text file,
reading the text file.

For text extraction command-line utilities you have a number of possibilities and there may be others not mentioned in the link (perhaps Java-based). Try them first to see if they fit your needs. That is, try each step separately (finding the links, downloading the files, extracting the text) and then piece them together. For calling out, use subprocess.Popen or subprocess.call().

answered Oct 04 '22 03:10

loevborg

Related questions
                            
                                How safe is expression evaluation using eval?
                            
                                If I use QT For Windows, will my application run great on Linux/Mac/Windows?
                            
                                Making your own statements
                            
                                Python: Draw a 2d grid and allow coloring of cells
                            
                                Python Vector Class
                            
                                Avoid IF statement after condition has been met
                            
                                Number of elements in Python Set
                            
                                undefined symbol: PyUnicodeUCS2_Decode whilst trying to install psycopg2
                            
                                How can I do the multiple replace in python?
                            
                                How to fix this python error? RuntimeError: dictionary changed size during iteration
                            
                                Running the same code for get(self) as post(self)
                            
                                String replacement on a whole text file in Python 3.x?
                            
                                Run a shell command from Django
                            
                                Which web technology to learn for an experienced C++ developer? [closed]
                            
                                In Django, how to get django-storages, boto and easy_thumbnail to work nicely?
                            
                                python: what happens to opened file if i quit before it is closed?
                            
                                python 2.7 / exec / what is wrong?
                            
                                Autocompletion in dynamic language IDEs, specifically Python in PyDev
                            
                                Django, automatic HTML "sanitizing" when putting HTML to template, how to stop it?
                            
                                Transform tuple to dict

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Converting a pdf to text/html in python so I can parse it

Tags:

python

text

parsing

pdf

Thomas Jensen

People also ask

2 Answers

Jack Cushman

loevborg

Recent Activity

Donate For Us