Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get HTML subtree from HTMLparser

I'm actually working with HTMLparser for python, i'm trying to get a HTML subtree contained in a specific node. I have a generic parser doing its job well, and once the interesting tag found, I would like to feed another specific HTMLParser with the data in this node.

This is an example of what i want to do :

class genericParser(HTMLParser):
   def __init__ (self):
       HTMLParser.__init__(self)
       self.divFound = False

   def handle_starttag (self, tag, attrs):
       if tag == "div" and ("class", "good") in attrs:
           self.divFound = True

   def handle_data (self, data):
       if self.divFound:
           print data    ## print nothing
           parser = specificParser ()
           parser.feed (data)
           self.divFound = False

and feed the genericParser with something like :

<html>
<head></head>
<body>
   <div class='good'>
      <ul>
         <li>test1</li>
         <li>test2</li>
      </ul>
   </div>
</body>
</html>

but in the python documentation of HTMLParser.handle_data :

This method is called to process arbitrary data (e.g. text nodes and the content of <script>...</script> and <style>...</style>).

In my genericParser, the data got in handle_data is empty because my <div class='good'> isn't a text node.

How can I retrieve the raw HTML data inner my div using HTMLParser ?

Thanks in advance

like image 509
Marcassin Avatar asked Feb 18 '26 00:02

Marcassin


1 Answers

I've solved this problem by buffering all data encountered in the interesting HTML node.

This one works but isn't very "clean" because the GenericParser has to parse the whole interesting block before fed the SpecificParser with it. Here is a "light" (without any errors handling) solution :

class genericParser(HTMLParser):
   def __init__ (self):
       HTMLParser.__init__ (self)
       self.divFound = False
       self.buff = ""
       self.level = 0

   def computeRecord (self, tag, attrs):
        mystr = "<" + tag + " "
        for att, val in attrs:
            mystr += att+"='"+val+ "' "
        mystr += ">"
        return mystr

   def handle_starttag (self, tag, attrs):
       if tag == "div" and ("class", "good") in attrs:
           self.divFound = True
       elif self.divFound:
          self.level += 1
          self.buff += self.computeRecord (tag, attrs)

   def handle_data (self, data):
       if self.divFound:
          self.buff += data


   def handle_endtag (self, tag):
      if self.divFound:
         self.buff += "</" + tag + ">"
         self.level -= 1
         if (self.level == 0):
            self.divFound = False
            print self.buff

The output is as desired :

<ul>
     <li>test1</li>
     <li>test2</li>
</ul>

As Birei said in comments, i would have been easier to extract the subtree with BeautifulSoup

soup = BeaufitulSoup (html)
div = soup("div", {"class" : "good"})
children = div[0].findChildren ()
print children[0]   #### desired output
like image 120
Marcassin Avatar answered Feb 19 '26 13:02

Marcassin



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!