I am trying to get a value out of a HTML page using the python HTMLParser library. The value I want to get hold of is within this HTML element:
...
<div id="remository">20</div>
...
This is my HTMLParser class so far:
class LinksParser(HTMLParser.HTMLParser):
def __init__(self):
HTMLParser.HTMLParser.__init__(self)
self.seen = {}
def handle_starttag(self, tag, attributes):
if tag != 'div': return
for name, value in attributes:
if name == 'id' and value == 'remository':
#print value
return
def handle_data(self, data):
print data
p = LinksParser()
f = urllib.urlopen("http://example.com/somepage.html")
html = f.read()
p.feed(html)
p.close()
I want the class functionality to get the value 20.
The HTMLParser class defined in this module provides functionality to parse HTML and XHMTL documents. This class contains handler methods that can identify tags, data, comments and other HTML elements. We have to define a new class that inherits HTMLParser class and submit HTML text using feed() method.
Reading the HTML file In the below example we make a request to an url to be loaded into the python environment. Then use the html parser parameter to read the entire html file. Next, we print first few lines of the html page. When we execute the above code, it produces the following result.
class LinksParser(HTMLParser.HTMLParser):
def __init__(self):
HTMLParser.HTMLParser.__init__(self)
self.recording = 0
self.data = []
def handle_starttag(self, tag, attributes):
if tag != 'div':
return
if self.recording:
self.recording += 1
return
for name, value in attributes:
if name == 'id' and value == 'remository':
break
else:
return
self.recording = 1
def handle_endtag(self, tag):
if tag == 'div' and self.recording:
self.recording -= 1
def handle_data(self, data):
if self.recording:
self.data.append(data)
self.recording
counts the number of nested div
tags starting from a "triggering" one. When we're in the sub-tree rooted in a triggering tag, we accumulate the data in self.data
.
The data at the end of the parse are left in self.data
(a list of strings, possibly empty if no triggering tag was met). Your code from outside the class can access the list directly from the instance at the end of the parse, or you can add appropriate accessor methods for the purpose, depending on what exactly is your goal.
The class could be easily made a bit more general by using, in lieu of the constant literal strings seen in the code above, 'div'
, 'id'
, and 'remository'
, instance attributes self.tag
, self.attname
and self.attvalue
, set by __init__
from arguments passed to it -- I avoided that cheap generalization step in the code above to avoid obscuring the core points (keep track of a count of nested tags and accumulate data into a list when the recording state is active).
Have You tried BeautifulSoup ?
from bs4 import BeautifulSoup
soup = BeautifulSoup('<div id="remository">20</div>')
tag=soup.div
print(tag.string)
This gives You 20
on output.
Little correction at Line 3
HTMLParser.HTMLParser.__init__(self)
it should be
HTMLParser.__init__(self)
The following worked for me though
import urllib2
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.recording = 0
self.data = []
def handle_starttag(self, tag, attrs):
if tag == 'required_tag':
for name, value in attrs:
if name == 'somename' and value == 'somevale':
print name, value
print "Encountered the beginning of a %s tag" % tag
self.recording = 1
def handle_endtag(self, tag):
if tag == 'required_tag':
self.recording -=1
print "Encountered the end of a %s tag" % tag
def handle_data(self, data):
if self.recording:
self.data.append(data)
p = MyHTMLParser()
f = urllib2.urlopen('http://www.example.com')
html = f.read()
p.feed(html)
print p.data
p.close()
`
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With