BeautifulSoup 3.1 parser breaks far too easily

Question

I was having trouble parsing some dodgy HTML with BeautifulSoup. Turns out that the HTMLParser used in newer versions is less tolerant than the SGMLParser used previously.

Does BeautifulSoup have some kind of debug mode? I'm trying to figure out how to stop it borking on some nasty HTML I'm loading from a crabby website:

<HTML>
    <HEAD>
        <TITLE>Title</TITLE>
        <HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">
    </HEAD>
    <BODY>
        ...
        ...
    </BODY>
</HTML>

BeautifulSoup gives up after the <HTTP-EQUIV...> tag

In [1]: print BeautifulSoup(c).prettify()
<html>
 <head>
  <title>
   Title
  </title>
 </head>
</html>

The problem is clearly the HTTP-EQUIV tag, which is really a very malformed <META HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE"> tag. Evidently, I need to specify this as self-closing, but no matter what I specify I can't fix it:

In [2]: print BeautifulSoup(c,selfClosingTags=['http-equiv',
                            'http-equiv="pragma"']).prettify()
<html>
 <head>
  <title>
   Title
  </title>
 </head>
</html>

Is there a verbose debug mode in which BeautifulSoup will tell me what it is doing, so I can figure out what it is treating as the tag name in this case?

jfs · Accepted Answer

Having problems with Beautiful Soup 3.1.0? recommends to use html5lib's parser as one of workarounds.

#!/usr/bin/env python
from html5lib import HTMLParser, treebuilders

parser = HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"))

c = """<HTML>
    <HEAD>
        <TITLE>Title</TITLE>
        <HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">
    </HEAD>
    <BODY>
        ...
        ...
    </BODY>
</HTML>"""

soup = parser.parse(c)
print soup.prettify()

Output:

<html>
 <head>
  <title>
   Title
  </title>
 </head>
 <body>
  <http-equiv="pragma" content="NO-CACHE">
   ...
        ...
  </http-equiv="pragma">
 </body>
</html>

The output shows that html5lib hasn't fixed the problem in this case though.

aehlke · Answer

Try lxml (and its html module). Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup. It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.

Ian Blicking agrees.

There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.

Ian Blicking agrees.

There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.

BeautifulSoup 3.1 parser breaks far too easily

Tags:

python

html

parsing

beautifulsoup

Mat

2 Answers

jfs

aehlke

Recent Activity

Donate For Us

BeautifulSoup 3.1 parser breaks far too easily

Tags:

python

html

parsing

beautifulsoup

Mat

2 Answers

jfs

aehlke

Related questions

Recent Activity

Donate For Us