Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

extracting text xpath scrapy

Tags:

html

xpath

scrapy

Hi all I would like to extract all the text from an html block using xpath in scrapy

Let's say we have a block like this:

<div>
   <p>Blahblah</p>
   <p><a>Bluhbluh</a></p>
   <p><a><span>Bliblih</span></a></p> 
</div>

I want to extract the text as ["Blahblah","Bluhbluh","Blihblih"]. I want xpath to recursively look for text in the div node. I have heard tried: //div/p[descendant-or-self::*]/text() but it does not extract nested elements.

Cheers! Seb

like image 894
eaglefreeman Avatar asked Oct 10 '14 14:10

eaglefreeman


People also ask

How do you write XPath for Scrapy?

When you are using text nodes in a XPath string function, then use . (dot) instead of using .//text(), because this produces the collection of text elements called as node-set.

How do you make a href in Scrapy?

We are using response. css() to select all the elements with the class title and the tag a. Then we are using the ::attr(href) to select the href attribute of all the elements we have selected. Then we are using the getall() to get all the values of the href attribute.


1 Answers

You can use XPath's string() function on each p element:

>>> import scrapy
>>> selector = scrapy.Selector(text="""<div>
...    <p>Blahblah</p>
...    <p><a>Bluhbluh</a></p>
...    <p><a><span>Bliblih</span></a></p> 
... </div>""")
>>> [p.xpath("string()").extract() for p in selector.xpath('//div/p')]
[[u'Blahblah'], [u'Bluhbluh'], [u'Bliblih']]
>>> import operator
>>> map(operator.itemgetter(0), [p.xpath("string()").extract() for p in selector.xpath('//div/p')])
[u'Blahblah', u'Bluhbluh', u'Bliblih']
>>> 
like image 83
paul trmbrth Avatar answered Sep 21 '22 16:09

paul trmbrth