Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting data from HTML-files with BeautifulSoup and Python

I need to extract data from HTML-files. The files in question are, most likely, automatically generated. I have uploaded the code of one of these files to Pastebin: http://pastebin.com/9Nj2Edfv. This is the link to the actual page: http://eur-lex.europa.eu/Notice.do?checktexts=checkbox&val=60504%3Acs&pos=1&page=1&lang=en&pgs=10&nbl=1&list=60504%3Acs%2C&hwords=&action=GO&visu=%23texte

The data I need to extract is found under the different headings.

This is what I have so far:

from BeautifulSoup import BeautifulSoup
ecj_data = open("data\ecj_1.html",'r').read()

soup = BeautifulSoup(ecj_data)

celex = soup.find('h1')
auth_lang = soup('ul', limit=14)[13].li
procedure = soup('ul', limit=20)[17].li

print "Celex number:", celex.renderContents(),
print "Authentic language:", auth_lang
print "Type of procedure:", procedure

I have all the data stored locally which is the reason it opens the file ecj_1.html.

The Celex number and the Authentic language works somewhat good.

celex returns

"Celex number: 
61977J0059"

auth_lang returns "Authentic language: <li>French</li>"

I need just the contents of the h1 tag (not the break at the end).

[Also, I need auth_lang to return just "French", and not the <li>-tags.] This is not a problem anymore. I realized I could just add ".text" to the end of "auth_lang".

Procedure on the other hand returns this:

    Type of procedure: <li>
    <strong>Type of procedure:</strong>
    <br />
    Reference for a preliminary ruling
    </li>

which is quite wrong as I just need it to return "Reference for a preliminary ruling".

Is there any way I can achieve this?

Second edit: I replaced celex = soup.find('h1') with celex = soup('h1', limit=2)[0] and added .text to the print celex.

like image 444
A2D2 Avatar asked Mar 20 '12 12:03

A2D2


1 Answers

The contents of each of the found sequences are lists, just the first two are length 1. However procedure is 5 elements long, and the entry you are after (in this case) is the 4th. I've used splitlines() to get rid of the newlines also.

print "Celex number:", celex.contents[0].splitlines()[1]
print "Authentic language:", auth_lang.contents[0].splitlines()[0]
print "Type of procedure:", procedure.contents[4].splitlines()[1]

output:

Celex number: 61977J0059
Authentic language: French
Type of procedure: Reference for a preliminary ruling
like image 98
fraxel Avatar answered Oct 24 '22 13:10

fraxel