I'm using a simple HTMLParser to parse a webpage with code that is always well-formed (it's automatically generated). It works well, until it hits a piece of data with an '&' sign in it - it seems to think that that makes it two separate pieces of data and processes them separately. (That is, it calls "handle_data" twice.) I at first thought that unescaping the '&'would solve the issue, but I don't think it does. Does anyone have any suggestion for how I can get my parser to treat, for instance "Paradise Bakery and Cafe" (that is, "Paradise Bakery & Café") as a single data item rather than as two?
Thanks a lot, bsg
P.S. Please don't tell me that I really should be using BeautifulSoup. I know. But in this case, I knew the markup was guaranteed to be well-formed every time, and I found HTMLParser easier to work with than BeautifulSoup. Thanks.
I'm adding my code - thanks!
#this class, extending HTMLParser, is written to process HTML within a <ul>.
#There are 6 <a> elements nested within each <li>, and I need the data from the second
#one. Whenever it encounters an <li> tag, it sets the 'is_li' flag to true and resets
#the count of a's seen to 0; whenever it encounters an <a> tag, it increments the count
#by 1. When handle_data is called, it checks to make sure that the data is within
#1)an li element and 2) an a element, and that the a element is the second one in that
#li (num_as == 2). If so, it adds the data to the list.
class MyHTMLParser(HTMLParser):
pages = []
is_li = 'false'
#is_li
num_as = 0
def _init_(self):
HTMLParser._init_(self)
self.pages = []
self.is_li = 'false'
self.num_as = 0
self.close_a = 'false'
sel.close_li = 'false'
print "initialized"
def handle_starttag(self, tag, attrs):
if tag == 'li':
self.is_li = 'true'
self.close_a = 'false'
self.close_li = 'false'
if tag == 'a' and self.is_li == 'true':
if self.num_as < 7:
self.num_as += 1
self.close_a = 'false'
else:
self.num_as = 0
self.is_li = 'false'
def handle_endtag(self, tag):
if tag == 'a':
self.close_a = 'true'
if tag == 'li':
self.close_li = 'true'
self.num_as = 0
def handle_data(self, data):
if self.is_li == 'true':
if self.num_as == 2 and self.close_li == 'false' and self.close_a == 'false':
print "found data", data
self.pages.append(data)
def get_pages(self):
return self.pages
This is because &
is the beginning of an HTML entity. A displayed &
should be represented as &
in the HTML (though browsers will display an &
followed by a space as an ampersand, I believe that technically this is invalid).
You'll just need to write your handle_data()
to accommodate the multiple calls, for example using a member variable that gets set to []
when you see your start tag and is appended to by each call to handle_data()
and is then joined into a string when you see your end tag.
I've taken a whack at it below. The key lines I added have a # *****
comment. I also took the liberty of using proper Booleans for your flags rather than strings, as it allows the code to be much cleaner (hopefully I didn't mess that up). I also changed your __init__()
to a reset()
method (so that your parser object can be reused) and removed the superfluous class variables. Finally, I added handle_entityref()
and handle_charref()
methods to handle escaped character entities.
class MyHTMLParser(HTMLParser):
def reset(self):
HTMLParser.reset(self)
self.pages = []
self.text = [] # *****
self.is_li = False
self.num_as = 0
self.close_a = False
self.close_li = False
def handle_starttag(self, tag, attrs):
if tag == 'li':
self.is_li = True
self.close_a = False
self.close_li = False
if tag == 'a' and self.is_li:
if self.num_as < 7:
self.num_as += 1
self.close_a = False
else:
self.num_as = 0
self.is_li = False
def handle_endtag(self, tag):
if tag == 'a':
self.close_a = True
if tag == 'li':
self.close_li = True
self.num_as = 0
self.pages.append("".join(self.text)) # *****
self.text = [] # *****
def handle_data(self, data):
if self.is_li:
if self.num_as == 2 and not self.close_li and not self.close_a:
print "found data", data
self.text.append(data) # *****
def handle_charref(self, ref):
self.handle_entityref("#" + ref)
def handle_entityref(self, ref):
self.handle_data(self.unescape("&%s;" % ref))
def get_pages(self):
return self.pages
The basic idea is that rather than appending to self.pages
on each call to handle_data()
you instead append to self.text
. Then you find some other event that will happen once for each text element (I chose when you see a </li>
tag but it might be when you see </a>
, I can't really tell without seeing some of your data too), join up those bits of text, and append that to pages
.
Hopefully this will give you an idea of the approach I'm talking about even if the exact code I posted doesn't work for you.
Unescaping &
would result strange behaviour at all &
. I made a class which does not split data into chunks at &
entities. You can find it HERE.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With