Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scrapy response.xpath returns empty array on xml document with default namespace, while response.re works

I am new to scrappy and I was playing with the scrapy shell trying to crawl this site: www.spiegel.de/sitemap.xml

I did it with

scrapy shell "http://www.spiegel.de/sitemap.xml"

and it works all fine, when i use

response.body 

i can see the whole page including xml tags

however for instance this:

response.xpath('//loc') 

simply wont work.

The result i get is an empty array

while

response.selector.re('somevalidregexpexpression') 

would work

any idea what could be the reason? could be related to encoding or so? the site is not utf-8

I am using python 2.7 on Win 7. I tried the xpath() on another site (dmoz) and it worked fine.

like image 337
elMeroMero Avatar asked Mar 25 '16 23:03

elMeroMero


People also ask

What does XPath do in XML?

The XML Path Language (XPath) is used to uniquely identify or address parts of an XML document. An XPath expression can be used to search through an XML document, and extract information from any part of the document, such as an element or attribute (referred to as a node in XML) in it.

How do you write XPath for Scrapy?

When you are using text nodes in a XPath string function, then use . (dot) instead of using .//text(), because this produces the collection of text elements called as node-set.

What is XPath in XSLT?

XPath is a major element in the XSLT standard. XPath can be used to navigate through elements and attributes in an XML document. XPath stands for XML Path Language. XPath uses "path like" syntax to identify and navigate nodes in an XML document. XPath contains over 200 built-in functions.


2 Answers

The problem was due to the default namespace declared at the root element of the XML :

xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"

So in that XML, the root element and its descendants without prefix inherits the same namespace, implicitly.

On the other hand, in XPath, you need to use prefix that bound to a namespace URI to reference element in that namespace, there is no such default namespace implied.

You can use selector.register_namespace() to bind a namespace prefix to the default namespace URI, and then use the prefix in your XPath :

response.selector.register_namespace('d', 'http://www.sitemaps.org/schemas/sitemap/0.9')
response.xpath('//d:loc')
like image 86
har07 Avatar answered Sep 19 '22 05:09

har07


You can also use xpath with local namespace such as in:

response.xpath("//*[local-name()='loc']")

This is especially useful if you are parsing responses from multiple heterogeneous sources and you don't want to register each and every namespace.

like image 21
Rabih Kodeih Avatar answered Sep 20 '22 05:09

Rabih Kodeih