Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting name of attributes with Scrapy XPATH

I'm trying to get both keys and values of attributes of some tag in a XML file (using scrapy and xpath).

The tag is something like:

<element attr1="value1" attr2="value2 ...>

I don't know the keys "attr1", "attr2" and so on, and they can change between two elements. I didn't figure out how to get both keys and values with xpath, is there any other good practice for doing that?

like image 299
Facundo Fabre Avatar asked Dec 19 '22 23:12

Facundo Fabre


1 Answers

Short version

>>> for element in selector.xpath('//element'):
...     attributes = []
...     # loop over all attribute nodes of the element
...     for index, attribute in enumerate(element.xpath('@*'), start=1):
...         # use XPath's name() string function on each attribute,
...         # using their position
...         attribute_name = element.xpath('name(@*[%d])' % index).extract_first()
...         # Scrapy's extract() on an attribute returns its value
...         attributes.append((attribute_name, attribute.extract()))
... 
>>> attributes # list of (attribute name, attribute value) tuples
[(u'attr1', u'value1'), (u'attr2', u'value2')]
>>> dict(attributes)
{u'attr2': u'value2', u'attr1': u'value1'}
>>> 

Long version

XPath has a name(node-set?) function to get node names (an attribute is a node, an attribute node):

The name function returns a string containing a QName representing the expanded-name of the node in the argument node-set that is first in document order.(...) If the argument it omitted, it defaults to a node-set with the context node as its only member.

(source: http://www.w3.org/TR/xpath/#function-name)

>>> import scrapy
>>> selector = scrapy.Selector(text='''
...     <html>
...     <element attr1="value1" attr2="value2">some text</element>
...     </html>''')
>>> selector.xpath('//element').xpath('name()').extract()
[u'element']

(Here, I chained name() on the result of //element selection, to apply the function to all selected element nodes. A handy feature of Scrapy selectors)

One would like to do the same with attribute nodes, right? But it does not work:

>>> selector.xpath('//element/@*').extract()
[u'value1', u'value2']
>>> selector.xpath('//element/@*').xpath('name()').extract()
[]
>>> 

Note: I don't know if it's a limitation of lxml/libxml2, which Scrapy uses under the hood, or if the XPath specs disallow it. (I don't see why it would.)

What you can do though is use name(node-set) form, i.e. with a non-empty node-set as parameter. If you read carefully the part of the XPath 1.0 specs I pasted above, as with other string functions, name(node-set) only takes into account the first node in the node-set (in document order):

>>> selector.xpath('//element').xpath('@*').extract()
[u'value1', u'value2']
>>> selector.xpath('//element').xpath('name(@*)').extract()
[u'attr1']
>>> 

Attribute nodes also have positions, so you can loop on all attributes by their position. Here we have 2 (result of count(@*) on the context node):

>>> for element in selector.xpath('//element'):
...     print element.xpath('count(@*)').extract_first()
... 
2.0
>>> for element in selector.xpath('//element'):
...     for i in range(1, 2+1):
...         print element.xpath('@*[%d]' % i).extract_first()
... 
value1
value2
>>> 

Now, you can guess what we can do: call name() for each @*[i]

>>> for element in selector.xpath('//element'):
...     for i in range(1, 2+1):
...         print element.xpath('name(@*[%d])' % i).extract_first()
... 
attr1
attr2
>>> 

If you put all this together, and assume that @* will get you attributes in document order (not said in the XPath 1.0 specs I think, but it's what I see happening with lxml), you end up with this:

>>> attributes = []
>>> for element in selector.xpath('//element'):
...     for index, attribute in enumerate(element.xpath('@*'), start=1):
...         attribute_name = element.xpath('name(@*[%d])' % index).extract_first()
...         attributes.append((attribute_name, attribute.extract()))
... 
>>> attributes
[(u'attr1', u'value1'), (u'attr2', u'value2')]
>>> dict(attributes)
{u'attr2': u'value2', u'attr1': u'value1'}
>>> 
like image 150
paul trmbrth Avatar answered Jan 01 '23 08:01

paul trmbrth