Get HTML subtree from HTMLparser

Question

I'm actually working with HTMLparser for python, i'm trying to get a HTML subtree contained in a specific node. I have a generic parser doing its job well, and once the interesting tag found, I would like to feed another specific HTMLParser with the data in this node.

This is an example of what i want to do :

class genericParser(HTMLParser):
   def __init__ (self):
       HTMLParser.__init__(self)
       self.divFound = False

   def handle_starttag (self, tag, attrs):
       if tag == "div" and ("class", "good") in attrs:
           self.divFound = True

   def handle_data (self, data):
       if self.divFound:
           print data    ## print nothing
           parser = specificParser ()
           parser.feed (data)
           self.divFound = False

and feed the genericParser with something like :

<html>
<head></head>
<body>
   <div class='good'>
      <ul>
         <li>test1</li>
         <li>test2</li>
      </ul>
   </div>
</body>
</html>

but in the python documentation of HTMLParser.handle_data :

This method is called to process arbitrary data (e.g. text nodes and the content of <script>...</script> and <style>...</style>).

In my genericParser, the data got in handle_data is empty because my <div class='good'> isn't a text node.

How can I retrieve the raw HTML data inner my div using HTMLParser ?

Thanks in advance

Marcassin · Accepted Answer

I've solved this problem by buffering all data encountered in the interesting HTML node.

This one works but isn't very "clean" because the GenericParser has to parse the whole interesting block before fed the SpecificParser with it. Here is a "light" (without any errors handling) solution :

class genericParser(HTMLParser):
   def __init__ (self):
       HTMLParser.__init__ (self)
       self.divFound = False
       self.buff = ""
       self.level = 0

   def computeRecord (self, tag, attrs):
        mystr = "<" + tag + " "
        for att, val in attrs:
            mystr += att+"='"+val+ "' "
        mystr += ">"
        return mystr

   def handle_starttag (self, tag, attrs):
       if tag == "div" and ("class", "good") in attrs:
           self.divFound = True
       elif self.divFound:
          self.level += 1
          self.buff += self.computeRecord (tag, attrs)

   def handle_data (self, data):
       if self.divFound:
          self.buff += data


   def handle_endtag (self, tag):
      if self.divFound:
         self.buff += "</" + tag + ">"
         self.level -= 1
         if (self.level == 0):
            self.divFound = False
            print self.buff

The output is as desired :

<ul>
     <li>test1</li>
     <li>test2</li>
</ul>

As Birei said in comments, i would have been easier to extract the subtree with BeautifulSoup

soup = BeaufitulSoup (html)
div = soup("div", {"class" : "good"})
children = div[0].findChildren ()
print children[0]   #### desired output

Get HTML subtree from HTMLparser

Tags:

python

html-parsing

python-2.7

Marcassin

1 Answers

Marcassin

Recent Activity

Donate For Us

Get HTML subtree from HTMLparser

Tags:

python

html-parsing

python-2.7

Marcassin

1 Answers

Marcassin

Related questions

Recent Activity

Donate For Us