Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is BeautifulSoup unable to correctly read/parse this RSS (XML) document?

YCombinator is nice enough to provide an RSS feed and a big RSS feed containing the top items on HackerNews. I am trying to write a python script to access the RSS feed document and then parse out certain pieces of information using BeautifulSoup. However, I am getting some strange behavior when BeautifulSoup tries to get the content of each of the items.

Here are a few sample lines of the RSS feed:

<rss version="2.0">
<channel>
<title>Hacker News</title><link>http://news.ycombinator.com/</link><description>Links for the intellectually curious, ranked by readers.</description>
<item>
    <title>EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and &#39;Notch&#39;</title>
    <link>https://www.eff.org/press/releases/eff-patent-project-gets-half-million-dollar-boost-mark-cuban-and-notch</link>
    <comments>http://news.ycombinator.com/item?id=4944322</comments>
    <description><![CDATA[<a href="http://news.ycombinator.com/item?id=4944322">Comments</a>]]></description>
</item>
<item>
    <title>Two Billion Pixel Photo of Mount Everest (can you find the climbers?)</title>
    <link>https://s3.amazonaws.com/Gigapans/EBC_Pumori_050112_8bit_FLAT/EBC_Pumori_050112_8bit_FLAT.html</link>
    <comments>http://news.ycombinator.com/item?id=4943361</comments>
    <description><![CDATA[<a href="http://news.ycombinator.com/item?id=4943361">Comments</a>]]></description>
</item>
...
</channel>
</rss>

Here is the code I have written (in python) to access this feed and print out the title, link, and comments for each item:

import sys
import requests
from bs4 import BeautifulSoup

request = requests.get('http://news.ycombinator.com/rss')
soup = BeautifulSoup(request.text)
items = soup.find_all('item')
for item in items:
    title = item.find('title').text
    link = item.find('link').text
    comments = item.find('comments').text
    print title + ' - ' + link + ' - ' + comments

However, this script is giving output that looks like this:

EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and &#39;Notch&#39; -  - http://news.ycombinator.com/item?id=4944322
Two Billion Pixel Photo of Mount Everest (can you find the climbers?) -  - http://news.ycombinator.com/item?id=4943361
...

As you can see, the middle item, link, is somehow being omitted. That is, the resulting value of link is somehow an empty string. So why is that?

As I dig into what is in soup, I realize that it is somehow choking when it parses the XML. This can be seen by looking at at the first item in items:

>>> print items[0]
<item><title>EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and &#39;Notch&#39;</title></link>https://www.eff.org/press/releases/eff-patent-project-gets-half-million-dollar-boost-mark-cuban-and-notch<comments>http://news.ycombinator.com/item?id=4944322</comments><description>...</description></item>

You'll notice that something screwy is happening with just the link tag. It just gets the close tag and then the text for that tag after it. This is some very strange behavior especially in contrast to title and comments being parsed without a problem.

This seems to be a problem with BeautifulSoup because what is actually read in by requests doesn't have any problems with it. I don't think it is limited to BeautifulSoup though because I tried using xml.etree.ElementTree API as well and the same problem arose (is BeautifulSoup built on this API?).

Does anyone know why this would be happening or how I can still use BeautifulSoup without getting this error?

Note: I was finally able to get what I wanted with xml.dom.minidom, but this doesn't seem like a highly recommended library. I would like to continue using BeautifulSoup if possible.

Update: I am on a Mac with OSX 10.8 using Python 2.7.2 and BS4 4.1.3.

Update 2: I have lxml and it was installed with pip. It is version 3.0.2. As far as libxml, I checked in /usr/lib and the one that shows up is libxml2.2.dylib. Not sure when or how that was installed.

like image 798
jbranchaud Avatar asked Dec 19 '12 21:12

jbranchaud


People also ask

Can BeautifulSoup parse XML?

Installation. BeautifulSoup is one of the most used libraries when it comes to web scraping with Python. Since XML files are similar to HTML files, it is also capable of parsing them. To parse XML files using BeautifulSoup though, it's best that you make use of Python's lxml parser.

What is the role of parse() function in ElementTree?

There are two ways to parse the file using 'ElementTree' module. The first is by using the parse() function and the second is fromstring() function. The parse () function parses XML document which is supplied as a file whereas, fromstring parses XML when supplied as a string i.e within triple quotes.

Is BeautifulSoup a parser?

Beautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.


1 Answers

Wow, great question. This strikes me as a bug in BeautifulSoup. The reason that you can't access the link using soup.find_all('item').link is that when you first load the html into BeautifulSoup to begin with, it does something odd to the HTML:

>>> from bs4 import BeautifulSoup as BS
>>> BS(html)
<html><body><rss version="2.0">
<channel>
<title>Hacker News</title><link/>http://news.ycombinator.com/<description>Links
for the intellectually curious, ranked by readers.</description>
<item>
<title>EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and 'No
tch'</title>
<link/>https://www.eff.org/press/releases/eff-patent-project-gets-half-million-d
ollar-boost-mark-cuban-and-notch
    <comments>http://news.ycombinator.com/item?id=4944322</comments>
<description>Comments]]&gt;</description>
</item>
<item>
<title>Two Billion Pixel Photo of Mount Everest (can you find the climbers?)</ti
tle>
<link/>https://s3.amazonaws.com/Gigapans/EBC_Pumori_050112_8bit_FLAT/EBC_Pumori_
050112_8bit_FLAT.html
    <comments>http://news.ycombinator.com/item?id=4943361</comments>
<description>Comments]]&gt;</description>
</item>
...
</channel>
</rss></body></html>

Look carefully--it has actually changed the first <link> tag to <link/> and then removed the </link> tag. I'm not sure why it would do this, but without fixing the problem in the BeautifulSoup.BeautifulSoup class initialization, you're not going to be able to use it for now.

Update:

I think your best (albeit hack-y) bet for now is to use the following for link:

>>> soup.find('item').link.next_sibling
u'http://news.ycombinator.com/'
like image 188
jdotjdot Avatar answered Oct 22 '22 23:10

jdotjdot