I'm trying to get both keys and values of attributes of some tag in a XML file (using scrapy and xpath).
The tag is something like:
<element attr1="value1" attr2="value2 ...>
I don't know the keys "attr1", "attr2" and so on, and they can change between two elements. I didn't figure out how to get both keys and values with xpath, is there any other good practice for doing that?
Short version
>>> for element in selector.xpath('//element'):
... attributes = []
... # loop over all attribute nodes of the element
... for index, attribute in enumerate(element.xpath('@*'), start=1):
... # use XPath's name() string function on each attribute,
... # using their position
... attribute_name = element.xpath('name(@*[%d])' % index).extract_first()
... # Scrapy's extract() on an attribute returns its value
... attributes.append((attribute_name, attribute.extract()))
...
>>> attributes # list of (attribute name, attribute value) tuples
[(u'attr1', u'value1'), (u'attr2', u'value2')]
>>> dict(attributes)
{u'attr2': u'value2', u'attr1': u'value1'}
>>>
Long version
XPath has a name(node-set?)
function to get node names (an attribute is a node, an attribute node):
The name function returns a string containing a QName representing the expanded-name of the node in the argument node-set that is first in document order.(...) If the argument it omitted, it defaults to a node-set with the context node as its only member.
(source: http://www.w3.org/TR/xpath/#function-name)
>>> import scrapy
>>> selector = scrapy.Selector(text='''
... <html>
... <element attr1="value1" attr2="value2">some text</element>
... </html>''')
>>> selector.xpath('//element').xpath('name()').extract()
[u'element']
(Here, I chained name()
on the result of //element
selection, to apply the function to all selected element nodes. A handy feature of Scrapy selectors)
One would like to do the same with attribute nodes, right? But it does not work:
>>> selector.xpath('//element/@*').extract()
[u'value1', u'value2']
>>> selector.xpath('//element/@*').xpath('name()').extract()
[]
>>>
Note: I don't know if it's a limitation of lxml/libxml2
, which Scrapy uses under the hood, or if the XPath specs disallow it. (I don't see why it would.)
What you can do though is use name(node-set)
form, i.e. with a non-empty node-set as parameter. If you read carefully the part of the XPath 1.0 specs I pasted above, as with other string functions, name(node-set)
only takes into account the first node in the node-set (in document order):
>>> selector.xpath('//element').xpath('@*').extract()
[u'value1', u'value2']
>>> selector.xpath('//element').xpath('name(@*)').extract()
[u'attr1']
>>>
Attribute nodes also have positions, so you can loop on all attributes by their position. Here we have 2 (result of count(@*)
on the context node):
>>> for element in selector.xpath('//element'):
... print element.xpath('count(@*)').extract_first()
...
2.0
>>> for element in selector.xpath('//element'):
... for i in range(1, 2+1):
... print element.xpath('@*[%d]' % i).extract_first()
...
value1
value2
>>>
Now, you can guess what we can do: call name()
for each @*[i]
>>> for element in selector.xpath('//element'):
... for i in range(1, 2+1):
... print element.xpath('name(@*[%d])' % i).extract_first()
...
attr1
attr2
>>>
If you put all this together, and assume that @*
will get you attributes in document order (not said in the XPath 1.0 specs I think, but it's what I see happening with lxml
), you end up with this:
>>> attributes = []
>>> for element in selector.xpath('//element'):
... for index, attribute in enumerate(element.xpath('@*'), start=1):
... attribute_name = element.xpath('name(@*[%d])' % index).extract_first()
... attributes.append((attribute_name, attribute.extract()))
...
>>> attributes
[(u'attr1', u'value1'), (u'attr2', u'value2')]
>>> dict(attributes)
{u'attr2': u'value2', u'attr1': u'value1'}
>>>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With