Well my initial code works but misses out some weird formatting in the site: <pre class="prettyprint"><code>response.xpath("//*[contains(., 'Description:')]/following-sibling::p/text()").extract() <div id="body"> <a name="main_content" id="main_content"></a>  <div class="return_to_div"><a href="../../index.html">HOME</a> | <a href="../index.html">DEATH ROW</a> | <a href="index.html">INFORMATION</a> | text</div> <h1>text</h1> <h2>text</h2> text: text text: text Description: Line1 Line2 Line3  </div> </code></pre> I have no problem pulling Line 1 and Line 2. Line 3 however is not a sibling of my P class. This only occurs for some of the pages I am trying to scrap from a table. Here is the link: https://www.tdcj.state.tx.us/death_row/dr_info/wardadamlast.html Sorry Xpath just confuses me, is there a way to extract all data thats after the criteria <code>//*[contains(., 'Description:')]</code> rather then having to be a sibling? Thanks in advance. Edited: Changed example to more reflect the actual. Added link to original page.

You can select all sibling nodes (elements and text nodes) after that <code></code> containing "Description:" (<code>following-sibling::node()</code>) and then fetch all text nodes (<code>descendant-or-self::text()</code>): <pre class="prettyprint"><code>>>> import scrapy >>> response = scrapy.Selector(text="""<div> ... Name ... Age ... Description: ... Line 1 ... Line 2 ... Line 3 ... </div>""", type="html") >>> response.xpath("""//div/p[contains(., 'Description:')] ... /following-sibling::node() ... /descendant-or-self::text()""").extract() [u'\n ', u' Line 1 ', u'\n ', u' Line 2 ', u'\nLine 3\n'] >>> </code></pre> Let's break it down. So, you already know how to locate the correct <code></code> containing "Description" (with XPath <code>//div/p[contains(., 'Description:')]</code>): <pre class="prettyprint"><code>>>> response.xpath("//div/p[contains(., 'Description:')]").extract() [u' Description: '] </code></pre> You want <code></code>s that come after (<code>following-sibling::</code> axis + <code>p</code> element selection): <pre class="prettyprint"><code>>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::p").extract() [u' Line 1 ', u' Line 2 '] </code></pre> This doesn't give you the 3rd line. So you read about XPath and try the "catch-all" <code>*</code>: <pre class="prettyprint"><code>>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::*").extract() [u' Line 1 ', u' Line 2 '] </code></pre> Still no luck. Why? because <code>*</code> only selects elements (commonly referred to as "tags", to simplify). The 3rd line you're after is a text node, child of the parent <code><div></code> element. But a text node is also a node (!) so you can select it as a sibling of that famous <code></code> above: <pre class="prettyprint"><code>>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()").extract() [u'\n ', u' Line 1 ', u'\n ', u' Line 2 ', u'\nLine 3\n'] </code></pre> Ok, so now it seems we have the nodes we want ("tag" elements and text nodes). But you still got those "<code></code>" in the output of <code>.extract()</code> (the XPath selected the elements, not their "inner" text). So you read about XPath more and use the <code>.//text()</code> step (roughly "all children text nodes from here") <pre class="prettyprint"><code>>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()//text()").extract() [u' Line 1 ', u' Line 2 '] </code></pre> Err, wait, where did the 3rd line go? In fact this <code>//</code> is short for <code>/descendant-or-self::node()/</code>, so <code>./descendant-or-self::node()/text()</code> will select children text nodes of only those next <code></code> (text nodes don't have children, <code>self::text()/text()</code> will never match any text node) <pre class="prettyprint"><code>>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()/descendant-or-self::node()/text()").extract() [u' Line 1 ', u' Line 2 '] </code></pre> What you can do here is to use the handy <code>descendant-or-self</code> axis + the <code>text()</code> node test, so if <code>following-sibling::node()</code> got to a text node, the "self" in <code>descendant-or-self</code> will match the text node, and <code>text()</code> node test with be true <pre class="prettyprint"><code>>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()/descendant-or-self::text()").extract() [u'\n ', u' Line 1 ', u'\n ', u' Line 2 ', u'\nLine 3\n'] </code></pre> Using the example URL from OP's edited question: <pre class="prettyprint"><code>$ scrapy shell https://www.tdcj.state.tx.us/death_row/dr_info/wardadamlast.html 2016-05-19 13:14:44 [scrapy] INFO: Scrapy 1.1.0 started (bot: scrapybot) 2016-05-19 13:14:44 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'} (...) 2016-05-19 13:14:48 [scrapy] INFO: Spider opened 2016-05-19 13:14:50 [scrapy] DEBUG: Crawled (200) <GET https://www.tdcj.state.tx.us/death_row/dr_info/wardadamlast.html> (referer: None) >>> t = response.xpath(""" ... //div/p[contains(., 'Last Statement:')] ... /following-sibling::node() ... /descendant-or-self::text()""").extract() >>> >>> >>> print(''.join(t)) I would like to thank everyone that has showed up on my behalf, Kathryn Cox, I love you dearly. Thank you Randy Cannon for showing up and being a lifelong friend. Thank you Dr. Steve Ball for trying to bring the right out. There are a lot of injustices that are happening with this. This is wrong. Thank you Reverend Leon Harrison for showing me the grace of God. Thank you for all of my friends that are out there. This is not a capital case. I never had intended to do anything. I feel very grieved for the loss of Walker, and for Donovan and Marissa Walker. I hope they can find peace and be productive in society. I would like to thank all of my friends on the row even though everything didn’t work, close isn’t good enough. I hope that positive change will come out of this. I would like to thank my father and mother for everything that they showed me. I would like to apologize for putting them through this. I would like to ask for the truth to come out and make positive changes. Above all else Donovan and Marissa can find love and peace. I hope they overcome the loss of their father. At no time did I intend to hurt him. When the truth comes out I hope that they can find closure. There are a lot of things that are not right in this world, I have had to overcome them myself. I hope all that are on the row, I hope they find peace and solace in their life. Everyone can find peace in a Christian God or whatever God they believe in. I thank you mom and dad for everything, I love you dearly. One last thing, I thank all of my friends that showed loyalty and graced my life with more positive. I would also like to thank Gustav’s mother for having such a great son, and showing me much love. I have met good people on the row, not all of them are bad. I hope everyone can see that. I just want to thank everybody that came to witness this. I thank everyone, I am sorry things didn’t work out. May God forgive us all? I am sorry mother and I am sorry father. I hope you find peace and solace in your heart. I know there is something else I need to say. I feel that. </code></pre>

Using Xpath in Scrapy to select any text below paragraph

Tags:

python

web-scraping

xpath

scrapy

scrapy-spider

Well my initial code works but misses out some weird formatting in the site:

response.xpath("//*[contains(., 'Description:')]/following-sibling::p/text()").extract()


  <div id="body">
  <a name="main_content" id="main_content"></a>
  <!-- InstanceBeginEditable name="main_content" -->
<div class="return_to_div"><a href="../../index.html">HOME</a>  | <a href="../index.html">DEATH ROW</a>  | <a href="index.html">INFORMATION</a>  | text</div>
<h1>text</h1>
<h2>text</h2>
<p class="text_bold">text:</p>
<p>text</p>
<p class="text_bold">text:</p>
<p>text</p>
<p class="text_bold">Description:</p>
<p>Line1</p>
<p>Line2</p>
Line3  <!-- InstanceEndEditable -->  
  </div>

I have no problem pulling Line 1 and Line 2. Line 3 however is not a sibling of my P class. This only occurs for some of the pages I am trying to scrap from a table.

Here is the link: https://www.tdcj.state.tx.us/death_row/dr_info/wardadamlast.html

Sorry Xpath just confuses me, is there a way to extract all data thats after the criteria //*[contains(., 'Description:')] rather then having to be a sibling?

Thanks in advance.

Edited: Changed example to more reflect the actual. Added link to original page.

841

asked May 19 '16 09:05

BernardL

1 Answers

You can select all sibling nodes (elements and text nodes) after that  containing "Description:" (following-sibling::node()) and then fetch all text nodes (descendant-or-self::text()):

>>> import scrapy
>>> response = scrapy.Selector(text="""<div>
...  <p> Name </p>
...  <p> Age  </p>
...  <p class="text-bold"> Description: </p>
...  <p> Line 1 </p>
...  <p> Line 2 </p>
... Line 3
... </div>""", type="html")
>>> response.xpath("""//div/p[contains(., 'Description:')]
...      /following-sibling::node()
...         /descendant-or-self::text()""").extract()
[u'\n ', u' Line 1 ', u'\n ', u' Line 2 ', u'\nLine 3\n']
>>>

Let's break it down.

So, you already know how to locate the correct  containing "Description" (with XPath //div/p[contains(., 'Description:')]):

>>> response.xpath("//div/p[contains(., 'Description:')]").extract()
[u'<p class="text-bold"> Description: </p>']

You want s that come after (following-sibling:: axis + p element selection):

>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::p").extract()
[u'<p> Line 1 </p>', u'<p> Line 2 </p>']

This doesn't give you the 3rd line. So you read about XPath and try the "catch-all" *:

>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::*").extract()
[u'<p> Line 1 </p>', u'<p> Line 2 </p>']

Still no luck. Why? because * only selects elements (commonly referred to as "tags", to simplify).

The 3rd line you're after is a text node, child of the parent <div> element. But a text node is also a node (!) so you can select it as a sibling of that famous  above:

>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()").extract()
[u'\n ', u'<p> Line 1 </p>', u'\n ', u'<p> Line 2 </p>', u'\nLine 3\n']

Ok, so now it seems we have the nodes we want ("tag" elements and text nodes). But you still got those "" in the output of .extract() (the XPath selected the elements, not their "inner" text).

So you read about XPath more and use the .//text() step (roughly "all children text nodes from here")

>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()//text()").extract()
[u' Line 1 ', u' Line 2 ']

Err, wait, where did the 3rd line go?

In fact this // is short for /descendant-or-self::node()/, so ./descendant-or-self::node()/text() will select children text nodes of only those next  (text nodes don't have children, self::text()/text() will never match any text node)

>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()/descendant-or-self::node()/text()").extract()
[u' Line 1 ', u' Line 2 ']

What you can do here is to use the handy descendant-or-self axis + the text() node test, so if following-sibling::node() got to a text node, the "self" in descendant-or-self will match the text node, and text() node test with be true

>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()/descendant-or-self::text()").extract()
[u'\n ', u' Line 1 ', u'\n ', u' Line 2 ', u'\nLine 3\n']

Using the example URL from OP's edited question:

$ scrapy shell https://www.tdcj.state.tx.us/death_row/dr_info/wardadamlast.html
2016-05-19 13:14:44 [scrapy] INFO: Scrapy 1.1.0 started (bot: scrapybot)
2016-05-19 13:14:44 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'}
(...)
2016-05-19 13:14:48 [scrapy] INFO: Spider opened
2016-05-19 13:14:50 [scrapy] DEBUG: Crawled (200) <GET https://www.tdcj.state.tx.us/death_row/dr_info/wardadamlast.html> (referer: None)

>>> t = response.xpath("""
...     //div/p[contains(., 'Last Statement:')]
...         /following-sibling::node()
...             /descendant-or-self::text()""").extract()
>>> 
>>> 
>>> print(''.join(t))

I would like to thank everyone that has showed up on my  behalf, Kathryn Cox, I love you dearly.  Thank you Randy Cannon for  showing up and being a lifelong friend.  Thank you Dr. Steve Ball for  trying to bring the right out.  There are a lot of injustices that are  happening with this.  This is wrong.  Thank you Reverend Leon  Harrison for showing me the grace of God.  Thank you for all of my friends  that are out there.  This is not a capital case.  I never had  intended to do anything.  I feel very grieved for the loss of Walker, and  for Donovan and Marissa Walker.  I hope they can find peace and be  productive in society.  I would like to thank all of my friends on the row  even though everything didn’t work, close isn’t good enough.  I hope that  positive change will come out of this.
I would like to thank my father and mother for everything  that they showed me.  I would like to apologize for putting them through  this.  I would like to ask for the truth to come out and make positive  changes.  Above all else Donovan and Marissa can find love and  peace.  I hope they overcome the loss of their father.  At no time  did I intend to hurt him.
When  the truth comes out I hope that they can find closure.  There are a lot of  things that are not right in this world, I have had to overcome them  myself.  I hope all that are on the row, I hope they find peace and solace  in their life. Everyone can find peace in a Christian God or whatever God they  believe in.  I thank you mom and dad for everything, I love you  dearly.  One last thing, I thank all of my friends that showed loyalty and  graced my life with more positive.  I would also like to thank Gustav’s  mother for having such a great son, and showing me much love.  I have met  good people on the row, not all of them are bad.  I hope everyone can see  that.  I just want to thank everybody that came to witness this.  I  thank everyone, I am sorry things didn’t work out.  May God forgive us  all?  I am sorry mother and I am sorry father.  I hope you find peace  and solace in your heart.  I know there is something else I need to  say.  I feel that.

answered Sep 27 '22 23:09

paul trmbrth

Related questions
                            
                                What is an object reference in Python?
                            
                                Why do I get this error "TypeError: 'method' object is not iterable"?
                            
                                Feeding tensors for training vs validation data
                            
                                how to check if non-key attribute already exists in dynamodb using ConditionExpression?
                            
                                pandas scatterplots: how to make unfilled symbols
                            
                                Collecting like term of an expression in Sympy
                            
                                PANDAS GroupBy Removing Header
                            
                                Tensorflow 0.7.1 with Cuda Toolkit 7.5 and cuDNN 7.0
                            
                                How to implement multivariate linear stochastic gradient descent algorithm in tensorflow?
                            
                                Python Plotly - Align Y Axis for Scatter and Bar
                            
                                Adding Items To Shopping Cart Django Python
                            
                                Split python dictionary to result in all combinations of values
                            
                                How to quit an iPython notebook debug session?
                            
                                Environment variable not accessible with Python with sudo [duplicate]
                            
                                Know if + or __add__ called on an object
                            
                                a Panel regression in Python
                            
                                Django REST Framework file upload causing an "Unsupported media type 'multipart/form-data'" error
                            
                                PySpark -- Convert List of Rows to Data Frame
                            
                                Accessing returned values from a function, by another function
                            
                                Interrupt all asyncio.sleep currently executing

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With