Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup can't parse a webpage?

I am using beautiful soup for parsing webpage now, I've heard it's very famous and good, but it doesn't seems works properly.

Here's what I did

import urllib2
from bs4 import BeautifulSoup

page = urllib2.urlopen("http://www.cnn.com/2012/10/14/us/skydiver-record-attempt/index.html?hpt=hp_t1")
soup = BeautifulSoup(page)
print soup.prettify()

I think this is kind of straightforward. I open the webpage and pass it to the beautifulsoup. But here's what I got:

Warning (from warnings module):

File "C:\Python27\lib\site-packages\bs4\builder\_htmlparser.py", line 149

"Python's built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help."))

...

HTMLParseError: bad end tag: u'</"+"script>', at line 634, column 94

I thought CNN website should be well designed, so I am not very sure what's going on though. Does anyone has idea about this?

like image 972
JLTChiu Avatar asked Oct 14 '12 21:10

JLTChiu


1 Answers

From the docs:

If you can, I recommend you install and use lxml for speed. If you’re using a version of Python 2 earlier than 2.7.3, or a version of Python 3 earlier than 3.2.2, it’s essential that you install lxml or html5lib–Python’s built-in HTML parser is just not very good in older versions.

Your code works as is (on Python 2.7, Python 3.3) if you install more robust parser on Python 2.7 (such as lxml or html5lib):

try:
    from urllib2 import urlopen
except ImportError:
    from urllib.request import urlopen # py3k

from bs4 import BeautifulSoup # $ pip install beautifulsoup4

url = "http://www.cnn.com/2012/10/14/us/skydiver-record-attempt/index.html?hpt=hp_t1"
soup = BeautifulSoup(urlopen(url))
print(soup.prettify())

HTMLParser.py - more robust SCRIPT tag parsing bug might be related.

like image 185
jfs Avatar answered Oct 02 '22 06:10

jfs