Hia
I am having problems parsing an rss feed from stackexchange in python. When I try to get the summary nodes, an empty list is return
I have been trying to solve this, but can't get my head around.
Can anyone help out? thanks a
In [3o]: import lxml.etree, urllib2
In [31]: url_cooking = 'http://cooking.stackexchange.com/feeds'
In [32]: cooking_content = urllib2.urlopen(url_cooking)
In [33]: cooking_parsed = lxml.etree.parse(cooking_content)
In [34]: cooking_texts = cooking_parsed.xpath('.//feed/entry/summary')
In [35]: cooking_texts
Out[35]: []
Take a look at these two versions
import lxml.html, lxml.etree
url_cooking = 'http://cooking.stackexchange.com/feeds'
#lxml.etree version
data = lxml.etree.parse(url_cooking)
summary_nodes = data.xpath('.//feed/entry/summary')
print('Found ' + str(len(summary_nodes)) + ' summary nodes')
#lxml.html version
data = lxml.html.parse(url_cooking)
summary_nodes = data.xpath('.//feed/entry/summary')
print('Found ' + str(len(summary_nodes)) + ' summary nodes')
As you discovered, the second version returns no nodes, but the lxml.html
version works fine. The etree
version is not working because it's expecting namespaces and the html
version is working because it ignores namespaces. Part way down http://lxml.de/lxmlhtml.html, it says "The HTML parser notably ignores namespaces and some other XMLisms."
Note when you print the root node of the etree version (print(data.getroot())
), you get something like <Element {http://www.w3.org/2005/Atom}feed at 0x22d1620>
. That means it's a feed element with a namespace of http://www.w3.org/2005/Atom
. Here is a corrected version of the etree code.
import lxml.html, lxml.etree
url_cooking = 'http://cooking.stackexchange.com/feeds'
ns = 'http://www.w3.org/2005/Atom'
ns_map = {'ns': ns}
data = lxml.etree.parse(url_cooking)
summary_nodes = data.xpath('//ns:feed/ns:entry/ns:summary', namespaces=ns_map)
print('Found ' + str(len(summary_nodes)) + ' summary nodes')
The problem is namespaces.
Run this :
cooking_parsed.getroot().tag
And you'll see that the element is namespaced as
{http://www.w3.org/2005/Atom}feed
Similarly if you navigate to one of the feed entries.
This means the right xpath in lxml is:
print cooking_parsed.xpath(
"//a:feed/a:entry",
namespaces={ 'a':'http://www.w3.org/2005/Atom' })
Try using BeautifulStoneSoup from the beautifulsoup import. It might do the trick.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With