Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scrapy get the entire text including children

I have a series of <p> elements inside a document I'm scraping with scrapy.
some of the are: <p><span>bla bla bla</span></p> or <p><span><span>bla bla bla</span><span>second bla bla</span></span></p>

I want to extract all the text with the children (assume I already have the selector of the <p)
(second example: to have a string bla bla bla second bla bla)

like image 719
Boaz Avatar asked Oct 25 '14 16:10

Boaz


2 Answers

Here are 2 options, either can have their benefits depending on the situation.

html sample

<p>Something outside the span<span> and something inside the span</span></p>

Option 01: use //text() -> returns list

response.xpath('//p//text()').getall()

# returns
>>> ['Something outside the span', ' and something inside the span']

Option 02: use string()-> returns string

response.xpath('string(//p)').get()

# returns
>>> 'Something outside the span and something inside the span'
like image 142
Brian Lynch Avatar answered Nov 08 '22 15:11

Brian Lynch


you can just use //text() to extract all text from children nodes

for example:

.//p//text()
like image 30
Anzel Avatar answered Nov 08 '22 15:11

Anzel