Hi all I would like to extract all the text from an html block using xpath in scrapy
Let's say we have a block like this:
<div>
<p>Blahblah</p>
<p><a>Bluhbluh</a></p>
<p><a><span>Bliblih</span></a></p>
</div>
I want to extract the text as ["Blahblah","Bluhbluh","Blihblih"]. I want xpath to recursively look for text in the div node.
I have heard tried: //div/p[descendant-or-self::*]/text()
but it does not extract nested elements.
Cheers! Seb
When you are using text nodes in a XPath string function, then use . (dot) instead of using .//text(), because this produces the collection of text elements called as node-set.
We are using response. css() to select all the elements with the class title and the tag a. Then we are using the ::attr(href) to select the href attribute of all the elements we have selected. Then we are using the getall() to get all the values of the href attribute.
You can use XPath's string()
function on each p
element:
>>> import scrapy
>>> selector = scrapy.Selector(text="""<div>
... <p>Blahblah</p>
... <p><a>Bluhbluh</a></p>
... <p><a><span>Bliblih</span></a></p>
... </div>""")
>>> [p.xpath("string()").extract() for p in selector.xpath('//div/p')]
[[u'Blahblah'], [u'Bluhbluh'], [u'Bliblih']]
>>> import operator
>>> map(operator.itemgetter(0), [p.xpath("string()").extract() for p in selector.xpath('//div/p')])
[u'Blahblah', u'Bluhbluh', u'Bliblih']
>>>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With