Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

lxml - difficulty parsing stackexchange rss feed

Hia

I am having problems parsing an rss feed from stackexchange in python. When I try to get the summary nodes, an empty list is return

I have been trying to solve this, but can't get my head around.

Can anyone help out? thanks a

In [3o]: import lxml.etree, urllib2

In [31]: url_cooking = 'http://cooking.stackexchange.com/feeds' 

In [32]: cooking_content = urllib2.urlopen(url_cooking)

In [33]: cooking_parsed = lxml.etree.parse(cooking_content)

In [34]: cooking_texts = cooking_parsed.xpath('.//feed/entry/summary')

In [35]: cooking_texts
Out[35]: []

like image 576
MrCastro Avatar asked Feb 23 '12 07:02

MrCastro


3 Answers

Take a look at these two versions

import lxml.html, lxml.etree

url_cooking = 'http://cooking.stackexchange.com/feeds'

#lxml.etree version
data = lxml.etree.parse(url_cooking)
summary_nodes = data.xpath('.//feed/entry/summary')
print('Found ' + str(len(summary_nodes)) + ' summary nodes')

#lxml.html version
data = lxml.html.parse(url_cooking)
summary_nodes = data.xpath('.//feed/entry/summary')
print('Found ' + str(len(summary_nodes)) + ' summary nodes')

As you discovered, the second version returns no nodes, but the lxml.html version works fine. The etree version is not working because it's expecting namespaces and the html version is working because it ignores namespaces. Part way down http://lxml.de/lxmlhtml.html, it says "The HTML parser notably ignores namespaces and some other XMLisms."

Note when you print the root node of the etree version (print(data.getroot())), you get something like <Element {http://www.w3.org/2005/Atom}feed at 0x22d1620>. That means it's a feed element with a namespace of http://www.w3.org/2005/Atom. Here is a corrected version of the etree code.

import lxml.html, lxml.etree

url_cooking = 'http://cooking.stackexchange.com/feeds'

ns = 'http://www.w3.org/2005/Atom'
ns_map = {'ns': ns}

data = lxml.etree.parse(url_cooking)
summary_nodes = data.xpath('//ns:feed/ns:entry/ns:summary', namespaces=ns_map)
print('Found ' + str(len(summary_nodes)) + ' summary nodes')
like image 120
gfortune Avatar answered Oct 15 '22 14:10

gfortune


The problem is namespaces.

Run this :

 cooking_parsed.getroot().tag

And you'll see that the element is namespaced as

{http://www.w3.org/2005/Atom}feed

Similarly if you navigate to one of the feed entries.

This means the right xpath in lxml is:

print cooking_parsed.xpath(
  "//a:feed/a:entry",
  namespaces={ 'a':'http://www.w3.org/2005/Atom' })
like image 6
Michael Anderson Avatar answered Oct 15 '22 13:10

Michael Anderson


Try using BeautifulStoneSoup from the beautifulsoup import. It might do the trick.

like image 1
user850498 Avatar answered Oct 15 '22 13:10

user850498