I am using beautiful soup for parsing webpage now, I've heard it's very famous and good, but it doesn't seems works properly.
Here's what I did
import urllib2
from bs4 import BeautifulSoup
page = urllib2.urlopen("http://www.cnn.com/2012/10/14/us/skydiver-record-attempt/index.html?hpt=hp_t1")
soup = BeautifulSoup(page)
print soup.prettify()
I think this is kind of straightforward. I open the webpage and pass it to the beautifulsoup. But here's what I got:
Warning (from warnings module):
File "C:\Python27\lib\site-packages\bs4\builder\_htmlparser.py", line 149
"Python's built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help."))
...
HTMLParseError: bad end tag: u'</"+"script>', at line 634, column 94
I thought CNN website should be well designed, so I am not very sure what's going on though. Does anyone has idea about this?
From the docs:
If you can, I recommend you install and use lxml for speed. If you’re using a version of Python 2 earlier than 2.7.3, or a version of Python 3 earlier than 3.2.2, it’s essential that you install lxml or html5lib–Python’s built-in HTML parser is just not very good in older versions.
Your code works as is (on Python 2.7, Python 3.3) if you install more robust parser on Python 2.7 (such as lxml or html5lib):
try:
from urllib2 import urlopen
except ImportError:
from urllib.request import urlopen # py3k
from bs4 import BeautifulSoup # $ pip install beautifulsoup4
url = "http://www.cnn.com/2012/10/14/us/skydiver-record-attempt/index.html?hpt=hp_t1"
soup = BeautifulSoup(urlopen(url))
print(soup.prettify())
HTMLParser.py - more robust SCRIPT tag parsing bug might be related.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With