I am new to scrappy and I was playing with the scrapy shell trying to crawl this site: www.spiegel.de/sitemap.xml
I did it with
scrapy shell "http://www.spiegel.de/sitemap.xml"
and it works all fine, when i use
response.body
i can see the whole page including xml tags
however for instance this:
response.xpath('//loc')
simply wont work.
The result i get is an empty array
while
response.selector.re('somevalidregexpexpression')
would work
any idea what could be the reason? could be related to encoding or so? the site is not utf-8
I am using python 2.7 on Win 7. I tried the xpath() on another site (dmoz) and it worked fine.
The XML Path Language (XPath) is used to uniquely identify or address parts of an XML document. An XPath expression can be used to search through an XML document, and extract information from any part of the document, such as an element or attribute (referred to as a node in XML) in it.
When you are using text nodes in a XPath string function, then use . (dot) instead of using .//text(), because this produces the collection of text elements called as node-set.
XPath is a major element in the XSLT standard. XPath can be used to navigate through elements and attributes in an XML document. XPath stands for XML Path Language. XPath uses "path like" syntax to identify and navigate nodes in an XML document. XPath contains over 200 built-in functions.
The problem was due to the default namespace declared at the root element of the XML :
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
So in that XML, the root element and its descendants without prefix inherits the same namespace, implicitly.
On the other hand, in XPath, you need to use prefix that bound to a namespace URI to reference element in that namespace, there is no such default namespace implied.
You can use selector.register_namespace()
to bind a namespace prefix to the default namespace URI, and then use the prefix in your XPath :
response.selector.register_namespace('d', 'http://www.sitemaps.org/schemas/sitemap/0.9')
response.xpath('//d:loc')
You can also use xpath with local namespace such as in:
response.xpath("//*[local-name()='loc']")
This is especially useful if you are parsing responses from multiple heterogeneous sources and you don't want to register each and every namespace.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With