Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read value from web page using python

Tags:

python

I am trying to read a value in a html page into a variable in a python script. I have already figured out a way of downloading the page to a local file using urllib and could extract the value with a bash script but would like to try it in Python.

import urllib
urllib.urlretrieve('http://url.com', 'page.htm')

The page has this in it:

<div name="mainbody" style="font-size: x-large;margin:auto;width:33;">
<b><a href="w.cgi?hsn=10543">Plateau (19:01)</a></b>
<br/> Wired: 17.4
<br/>P10 Chard: 16.7
<br/>P1 P. Gris: 17.1
<br/>P20 Pinot Noir: 15.8-
<br/>Soil Temp : Error
<br/>Rainfall: 0.2<br/>
</div>

I need the 17.4 value from the Wired: line

Any suggestions?

Thanks

like image 699
user2845506 Avatar asked Oct 17 '25 23:10

user2845506


2 Answers

Start with not using urlretrieve(); you want the data, not a file.

Next, use a HTML parser. BeautifulSoup is great for extracting text from HTML.

Retrieving the page with urllib2 would be:

from urllib2 import urlopen

response = urlopen('http://url.com/')

then read the data into BeautifulSoup:

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.read(), from_encoding=response.headers.getparam('charset'))

The from_encoding part there will tell BeautifulSoup what encoding the web server told you to use for the page; if the web server did not specify this then BeautifulSoup will make an educated guess for you.

Now you can search for your data:

for line in soup.find('div', {'name': 'mainbody'}).stripped_strings:
    if 'Wired:' in line:
        value = float(line.partition('Wired:')[2])
        print value

For your demo HTML snippet that gives:

>>> for line in soup.find('div', {'name': 'mainbody'}).stripped_strings:
...     if 'Wired:' in line:
...         value = float(line.partition('Wired:')[2])
...         print value
... 
17.4
like image 109
Martijn Pieters Avatar answered Oct 19 '25 14:10

Martijn Pieters


This is called web scraping and there's a very popular library for doing this in Python, it's called Beautiful Soup:

http://www.crummy.com/software/BeautifulSoup/

If you'd like to do it with urllib/urllib2, you can accomplish that using regular expressions:

http://docs.python.org/2/library/re.html

Using regex, you basically use the surrounding context of your desired value as the key, then strip the key away. So in this case you might match from "Wired: " to the next newline character, then strip away the "Wired: " and the newline character.

like image 44
Adelmar Avatar answered Oct 19 '25 12:10

Adelmar



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!