I'm actually working with HTMLparser for python, i'm trying to get a HTML subtree contained in a specific node. I have a generic parser doing its job well, and once the interesting tag found, I would like to feed another specific HTMLParser with the data in this node.
This is an example of what i want to do :
class genericParser(HTMLParser):
def __init__ (self):
HTMLParser.__init__(self)
self.divFound = False
def handle_starttag (self, tag, attrs):
if tag == "div" and ("class", "good") in attrs:
self.divFound = True
def handle_data (self, data):
if self.divFound:
print data ## print nothing
parser = specificParser ()
parser.feed (data)
self.divFound = False
and feed the genericParser with something like :
<html>
<head></head>
<body>
<div class='good'>
<ul>
<li>test1</li>
<li>test2</li>
</ul>
</div>
</body>
</html>
but in the python documentation of HTMLParser.handle_data :
This method is called to process arbitrary data (e.g. text nodes and the content of
<script>...</script>and<style>...</style>).
In my genericParser, the data got in handle_data is empty because my <div class='good'> isn't a text node.
How can I retrieve the raw HTML data inner my div using HTMLParser ?
Thanks in advance
I've solved this problem by buffering all data encountered in the interesting HTML node.
This one works but isn't very "clean" because the GenericParser has to parse the whole interesting block before fed the SpecificParser with it. Here is a "light" (without any errors handling) solution :
class genericParser(HTMLParser):
def __init__ (self):
HTMLParser.__init__ (self)
self.divFound = False
self.buff = ""
self.level = 0
def computeRecord (self, tag, attrs):
mystr = "<" + tag + " "
for att, val in attrs:
mystr += att+"='"+val+ "' "
mystr += ">"
return mystr
def handle_starttag (self, tag, attrs):
if tag == "div" and ("class", "good") in attrs:
self.divFound = True
elif self.divFound:
self.level += 1
self.buff += self.computeRecord (tag, attrs)
def handle_data (self, data):
if self.divFound:
self.buff += data
def handle_endtag (self, tag):
if self.divFound:
self.buff += "</" + tag + ">"
self.level -= 1
if (self.level == 0):
self.divFound = False
print self.buff
The output is as desired :
<ul>
<li>test1</li>
<li>test2</li>
</ul>
As Birei said in comments, i would have been easier to extract the subtree with BeautifulSoup
soup = BeaufitulSoup (html)
div = soup("div", {"class" : "good"})
children = div[0].findChildren ()
print children[0] #### desired output
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With