BeautifulSoup can't parse a webpage?

Question

I am using beautiful soup for parsing webpage now, I've heard it's very famous and good, but it doesn't seems works properly.

Here's what I did

import urllib2
from bs4 import BeautifulSoup

page = urllib2.urlopen("http://www.cnn.com/2012/10/14/us/skydiver-record-attempt/index.html?hpt=hp_t1")
soup = BeautifulSoup(page)
print soup.prettify()

I think this is kind of straightforward. I open the webpage and pass it to the beautifulsoup. But here's what I got:

Warning (from warnings module):

File "C:\Python27\lib\site-packages\bs4\builder\_htmlparser.py", line 149

"Python's built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help."))

...

HTMLParseError: bad end tag: u'</"+"script>', at line 634, column 94

I thought CNN website should be well designed, so I am not very sure what's going on though. Does anyone has idea about this?

jfs · Accepted Answer

From the docs:

If you can, I recommend you install and use lxml for speed. If you’re using a version of Python 2 earlier than 2.7.3, or a version of Python 3 earlier than 3.2.2, it’s essential that you install lxml or html5lib–Python’s built-in HTML parser is just not very good in older versions.

Your code works as is (on Python 2.7, Python 3.3) if you install more robust parser on Python 2.7 (such as lxml or html5lib):

try:
    from urllib2 import urlopen
except ImportError:
    from urllib.request import urlopen # py3k

from bs4 import BeautifulSoup # $ pip install beautifulsoup4

url = "http://www.cnn.com/2012/10/14/us/skydiver-record-attempt/index.html?hpt=hp_t1"
soup = BeautifulSoup(urlopen(url))
print(soup.prettify())

HTMLParser.py - more robust SCRIPT tag parsing bug might be related.

BeautifulSoup can't parse a webpage?

Tags:

python

parsing

beautifulsoup

JLTChiu

1 Answers

jfs

Recent Activity

Donate For Us

BeautifulSoup can't parse a webpage?

Tags:

python

parsing

beautifulsoup

JLTChiu

1 Answers

jfs

Related questions

Recent Activity

Donate For Us