Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

reading some content from a web page read in python

Tags:

python

I am trying to read some data from a python module from a web.

I manage to read, however having some difficulty in parsing this data and getting the required information.

My code is below. Any help is appreciated.

#!/usr/bin/python2.7 -tt

import urllib
import urllib2

def Connect2Web():
  aResp = urllib2.urlopen("https://uniservices1.uobgroup.com/secure/online_rates/gold_and_silver_prices.jsp");
  web_pg = aResp.read();

  print web_pg

#Define a main() function that prints a litte greeting
def main():
  Connect2Web()

# This is the standard boilerplate that calls the maun function.
if __name__ == '__main__':
    main()

When I print this web page I get the whole web page printed.

I want to extract some information from it, (e.g. "SILVER PASSBOOK ACCOUNT" and get the rate from it), I am having some difficulties in parsing this html document.

like image 892
tush1r Avatar asked Dec 09 '22 01:12

tush1r


2 Answers

It's not recommended to use RE to match XML/HTML. It can sometimes work, however. It's better to use an HTML parser and a DOM API. Here's an example:

import html5lib
import urllib2

aResp = urllib2.urlopen("https://uniservices1.uobgroup.com/secure/online_rates/gold_and_silver_prices.jsp")
t = aResp.read()
dom = html5lib.parse(t, treebuilder="dom")
trlist = dom.getElementsByTagName("tr")
print trlist[-3].childNodes[1].firstChild.childNodes[0].nodeValue

You could iterate over trlist to find your interesting data.

Added from comment: html5lib is third party module. See html5lib site. The easy_install or pip program should be able to install it.

like image 140
Keith Avatar answered Dec 29 '22 12:12

Keith


It's possible to use regexps to get required data:

import urllib
import urllib2
import re

def Connect2Web():
  aResp = urllib2.urlopen("https://uniservices1.uobgroup.com/secure/online_rates/gold_and_silver_prices.jsp");
  web_pg = aResp.read();

  pattern = "<td><b>SILVER PASSBOOK ACCOUNT</b></td>" + "<td>(.*)</td>" * 4
  m = re.search(pattern, web_pg)
  if m:
    print "SILVER PASSBOOK ACCOUNT:"
    print "\tCurrency:", m.group(1)
    print "\tUnit:", m.group(2)
    print "\tBank Sells:", m.group(3)
    print "\tBank Buys:", m.group(4)
  else:
    print "Nothing found"

Don't forget to re.compile the pattern if you are doing your matches in loop.

like image 39
max taldykin Avatar answered Dec 29 '22 11:12

max taldykin