I need to extract data from HTML-files. The files in question are, most likely, automatically generated. I have uploaded the code of one of these files to Pastebin: http://pastebin.com/9Nj2Edfv. This is the link to the actual page: http://eur-lex.europa.eu/Notice.do?checktexts=checkbox&val=60504%3Acs&pos=1&page=1&lang=en&pgs=10&nbl=1&list=60504%3Acs%2C&hwords=&action=GO&visu=%23texte
The data I need to extract is found under the different headings.
This is what I have so far:
from BeautifulSoup import BeautifulSoup
ecj_data = open("data\ecj_1.html",'r').read()
soup = BeautifulSoup(ecj_data)
celex = soup.find('h1')
auth_lang = soup('ul', limit=14)[13].li
procedure = soup('ul', limit=20)[17].li
print "Celex number:", celex.renderContents(),
print "Authentic language:", auth_lang
print "Type of procedure:", procedure
I have all the data stored locally which is the reason it opens the file ecj_1.html.
The Celex number and the Authentic language works somewhat good.
celex returns
"Celex number:
61977J0059"
auth_lang returns "Authentic language: <li>French</li>"
I need just the contents of the h1 tag (not the break at the end).
[Also, I need auth_lang to return just "French", and not the <li>
-tags.]
This is not a problem anymore. I realized I could just add ".text" to the end of "auth_lang".
Procedure on the other hand returns this:
Type of procedure: <li>
<strong>Type of procedure:</strong>
<br />
Reference for a preliminary ruling
</li>
which is quite wrong as I just need it to return "Reference for a preliminary ruling".
Is there any way I can achieve this?
Second edit:
I replaced celex = soup.find('h1')
with celex = soup('h1', limit=2)[0]
and added .text
to the print celex.
The contents of each of the found sequences are lists, just the first two are length 1. However procedure
is 5 elements long, and the entry you are after (in this case) is the 4th. I've used splitlines() to get rid of the newlines also.
print "Celex number:", celex.contents[0].splitlines()[1]
print "Authentic language:", auth_lang.contents[0].splitlines()[0]
print "Type of procedure:", procedure.contents[4].splitlines()[1]
output:
Celex number: 61977J0059
Authentic language: French
Type of procedure: Reference for a preliminary ruling
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With