Hi all I would like to extract all the text from an html block using xpath in scrapy Let's say we have a block like this: <pre class="prettyprint"><code><div> Blahblah <a>Bluhbluh</a> <a>Bliblih</a> </div> </code></pre> I want to extract the text as ["Blahblah","Bluhbluh","Blihblih"]. I want xpath to recursively look for text in the div node. I have heard tried: <code>//div/p[descendant-or-self::*]/text()</code> but it does not extract nested elements. Cheers! Seb

You can use XPath's <code>string()</code> function on each <code>p</code> element: <pre class="prettyprint"><code>>>> import scrapy >>> selector = scrapy.Selector(text="""<div> ... Blahblah ... <a>Bluhbluh</a> ... <a>Bliblih</a> ... </div>""") >>> [p.xpath("string()").extract() for p in selector.xpath('//div/p')] [[u'Blahblah'], [u'Bluhbluh'], [u'Bliblih']] >>> import operator >>> map(operator.itemgetter(0), [p.xpath("string()").extract() for p in selector.xpath('//div/p')]) [u'Blahblah', u'Bluhbluh', u'Bliblih'] >>> </code></pre>

extracting text xpath scrapy

Tags:

html

xpath

scrapy

Hi all I would like to extract all the text from an html block using xpath in scrapy

Let's say we have a block like this:

<div>
   <p>Blahblah</p>
   <p><a>Bluhbluh</a></p>
   <p><a><span>Bliblih</span></a></p> 
</div>

I want to extract the text as ["Blahblah","Bluhbluh","Blihblih"]. I want xpath to recursively look for text in the div node. I have heard tried: //div/p[descendant-or-self::*]/text() but it does not extract nested elements.

Cheers! Seb

894

asked Oct 10 '14 14:10

eaglefreeman

1 Answers

You can use XPath's string() function on each p element:

>>> import scrapy
>>> selector = scrapy.Selector(text="""<div>
...    <p>Blahblah</p>
...    <p><a>Bluhbluh</a></p>
...    <p><a><span>Bliblih</span></a></p> 
... </div>""")
>>> [p.xpath("string()").extract() for p in selector.xpath('//div/p')]
[[u'Blahblah'], [u'Bluhbluh'], [u'Bliblih']]
>>> import operator
>>> map(operator.itemgetter(0), [p.xpath("string()").extract() for p in selector.xpath('//div/p')])
[u'Blahblah', u'Bluhbluh', u'Bliblih']
>>>

answered Sep 21 '22 16:09

paul trmbrth

Related questions
                            
                                Set autofocus on Laravel input field using Form::text
                            
                                How to html input to Flask?
                            
                                Validate Multiple Email Addresses with HTML5
                            
                                Why Does a Label Inside an Input Trigger a Click Event
                            
                                Insert HTML special characters in (i18n) yml
                            
                                css - problems with negative margin in mail template
                            
                                PDF file download through XHR Request
                            
                                Include html from haml
                            
                                How to .append text to a div using jQuery?
                            
                                HTML5 video of type video/mp4 playing audio only
                            
                                Apostrophe converted into & # 039 ; in twig
                            
                                HTML5 canvas get coordinates after zoom and translate
                            
                                How to fill the anchor element's height to 100% in html?
                            
                                Break string after specific word and put remains on new line (Regex)
                            
                                Last line of NSAttributedString not rendered in UILabel
                            
                                Dropdown arrow is not displayed in Twitter bootstrap for Firefox browser
                            
                                fullscreen overlay with css/jquery
                            
                                Two legends in a fieldset
                            
                                Check if element has the class active, if so, add class to a different element
                            
                                Faking the :has() "parent selector" using only CSS

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With