Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python HTMLParser dividing data at &

I'm using a simple HTMLParser to parse a webpage with code that is always well-formed (it's automatically generated). It works well, until it hits a piece of data with an '&' sign in it - it seems to think that that makes it two separate pieces of data and processes them separately. (That is, it calls "handle_data" twice.) I at first thought that unescaping the '&'would solve the issue, but I don't think it does. Does anyone have any suggestion for how I can get my parser to treat, for instance "Paradise Bakery and Cafe" (that is, "Paradise Bakery & Café") as a single data item rather than as two?

Thanks a lot, bsg

P.S. Please don't tell me that I really should be using BeautifulSoup. I know. But in this case, I knew the markup was guaranteed to be well-formed every time, and I found HTMLParser easier to work with than BeautifulSoup. Thanks.

I'm adding my code - thanks!

#this class, extending HTMLParser, is written to process HTML within a <ul>. 
#There are 6 <a> elements nested within each <li>, and I need the data from the second 
#one. Whenever it encounters an <li> tag, it sets the 'is_li' flag to true and resets 
#the count of a's seen to 0; whenever it encounters an <a> tag, it increments the count
#by 1.   When handle_data is called, it checks to make sure that the data is within
#1)an li element and 2) an a element, and that the a element is the second one in that
#li (num_as == 2). If so, it adds the data to the list. 

class MyHTMLParser(HTMLParser):
pages = []
is_li = 'false'
#is_li 
num_as = 0

def _init_(self):
    HTMLParser._init_(self)
    self.pages = []
    self.is_li = 'false'
    self.num_as = 0
    self.close_a = 'false'
    sel.close_li = 'false'
    print "initialized"


def handle_starttag(self, tag, attrs):
      if tag == 'li':
          self.is_li = 'true'
          self.close_a = 'false'
          self.close_li = 'false'


      if tag == 'a' and self.is_li == 'true':
          if self.num_as < 7:
              self.num_as += 1
              self.close_a = 'false'

          else:
              self.num_as = 0
              self.is_li = 'false'

def handle_endtag(self, tag):
     if tag == 'a':
         self.close_a = 'true'

     if tag == 'li':
         self.close_li = 'true'
         self.num_as = 0

def handle_data(self, data):
    if self.is_li == 'true':
        if self.num_as == 2 and self.close_li == 'false' and self.close_a == 'false':
            print "found data",  data
            self.pages.append(data)

def get_pages(self):
    return self.pages
like image 844
bsg Avatar asked Mar 14 '12 21:03

bsg


2 Answers

This is because & is the beginning of an HTML entity. A displayed & should be represented as &amp; in the HTML (though browsers will display an & followed by a space as an ampersand, I believe that technically this is invalid).

You'll just need to write your handle_data() to accommodate the multiple calls, for example using a member variable that gets set to [] when you see your start tag and is appended to by each call to handle_data() and is then joined into a string when you see your end tag.

I've taken a whack at it below. The key lines I added have a # ***** comment. I also took the liberty of using proper Booleans for your flags rather than strings, as it allows the code to be much cleaner (hopefully I didn't mess that up). I also changed your __init__() to a reset() method (so that your parser object can be reused) and removed the superfluous class variables. Finally, I added handle_entityref() and handle_charref() methods to handle escaped character entities.

class MyHTMLParser(HTMLParser):

    def reset(self):
        HTMLParser.reset(self)
        self.pages    = []
        self.text     = []                     # *****
        self.is_li    = False
        self.num_as   = 0
        self.close_a  = False
        self.close_li = False

    def handle_starttag(self, tag, attrs):
          if tag == 'li':
              self.is_li    = True
              self.close_a  = False
              self.close_li = False

          if tag == 'a' and self.is_li:
              if self.num_as < 7:
                  self.num_as += 1
                  self.close_a = False
              else:
                  self.num_as = 0
                  self.is_li = False

    def handle_endtag(self, tag):
         if tag == 'a':
             self.close_a  = True
         if tag == 'li':
             self.close_li = True
             self.num_as   = 0
             self.pages.append("".join(self.text))      # *****
             self.text = []                             # *****

    def handle_data(self, data):
        if self.is_li:
            if self.num_as == 2 and not self.close_li and not self.close_a:
                print "found data",  data
                self.text.append(data)              # *****

    def handle_charref(self, ref):
        self.handle_entityref("#" + ref)

    def handle_entityref(self, ref):
        self.handle_data(self.unescape("&%s;" % ref))

    def get_pages(self):
        return self.pages

The basic idea is that rather than appending to self.pages on each call to handle_data() you instead append to self.text. Then you find some other event that will happen once for each text element (I chose when you see a </li> tag but it might be when you see </a>, I can't really tell without seeing some of your data too), join up those bits of text, and append that to pages.

Hopefully this will give you an idea of the approach I'm talking about even if the exact code I posted doesn't work for you.

like image 173
kindall Avatar answered Sep 22 '22 10:09

kindall


Unescaping & would result strange behaviour at all &amp;. I made a class which does not split data into chunks at & entities. You can find it HERE.

like image 40
SzieberthAdam Avatar answered Sep 20 '22 10:09

SzieberthAdam