Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Xpath in Scrapy to select any text below paragraph

Well my initial code works but misses out some weird formatting in the site:

response.xpath("//*[contains(., 'Description:')]/following-sibling::p/text()").extract()


  <div id="body">
  <a name="main_content" id="main_content"></a>
  <!-- InstanceBeginEditable name="main_content" -->
<div class="return_to_div"><a href="../../index.html">HOME</a>  | <a href="../index.html">DEATH ROW</a>  | <a href="index.html">INFORMATION</a>  | text</div>
<h1>text</h1>
<h2>text</h2>
<p class="text_bold">text:</p>
<p>text</p>
<p class="text_bold">text:</p>
<p>text</p>
<p class="text_bold">Description:</p>
<p>Line1</p>
<p>Line2</p>
Line3  <!-- InstanceEndEditable -->  
  </div>

I have no problem pulling Line 1 and Line 2. Line 3 however is not a sibling of my P class. This only occurs for some of the pages I am trying to scrap from a table.

Here is the link: https://www.tdcj.state.tx.us/death_row/dr_info/wardadamlast.html

Sorry Xpath just confuses me, is there a way to extract all data thats after the criteria //*[contains(., 'Description:')] rather then having to be a sibling?

Thanks in advance.

Edited: Changed example to more reflect the actual. Added link to original page.

like image 841
BernardL Avatar asked May 19 '16 09:05

BernardL


People also ask

How do I get text from XPath in Scrapy?

When you are using text nodes in a XPath string function, then use . (dot) instead of using .//text(), because this produces the collection of text elements called as node-set.

How do I extract text from Scrapy?

Description. /html/head/title − This will select the <title> element, inside the <head> element of an HTML document. /html/head/title/text() − This will select the text within the same <title> element. //td − This will select all the elements from <td>.

How do you select the first element in XPath?

Find the first element by CSS selector. Find the first element by tag name. Find the first element by partial link text. In order to get the first element of an ID or name of an element, we can use the XPath to display the first value of an element.


1 Answers

You can select all sibling nodes (elements and text nodes) after that <p> containing "Description:" (following-sibling::node()) and then fetch all text nodes (descendant-or-self::text()):

>>> import scrapy
>>> response = scrapy.Selector(text="""<div>
...  <p> Name </p>
...  <p> Age  </p>
...  <p class="text-bold"> Description: </p>
...  <p> Line 1 </p>
...  <p> Line 2 </p>
... Line 3
... </div>""", type="html")
>>> response.xpath("""//div/p[contains(., 'Description:')]
...      /following-sibling::node()
...         /descendant-or-self::text()""").extract()
[u'\n ', u' Line 1 ', u'\n ', u' Line 2 ', u'\nLine 3\n']
>>> 

Let's break it down.

So, you already know how to locate the correct <p> containing "Description" (with XPath //div/p[contains(., 'Description:')]):

>>> response.xpath("//div/p[contains(., 'Description:')]").extract()
[u'<p class="text-bold"> Description: </p>']

You want <p>s that come after (following-sibling:: axis + p element selection):

>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::p").extract()
[u'<p> Line 1 </p>', u'<p> Line 2 </p>']

This doesn't give you the 3rd line. So you read about XPath and try the "catch-all" *:

>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::*").extract()
[u'<p> Line 1 </p>', u'<p> Line 2 </p>']

Still no luck. Why? because * only selects elements (commonly referred to as "tags", to simplify).

The 3rd line you're after is a text node, child of the parent <div> element. But a text node is also a node (!) so you can select it as a sibling of that famous <p> above:

>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()").extract()
[u'\n ', u'<p> Line 1 </p>', u'\n ', u'<p> Line 2 </p>', u'\nLine 3\n']

Ok, so now it seems we have the nodes we want ("tag" elements and text nodes). But you still got those "<p>" in the output of .extract() (the XPath selected the elements, not their "inner" text).

So you read about XPath more and use the .//text() step (roughly "all children text nodes from here")

>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()//text()").extract()
[u' Line 1 ', u' Line 2 ']

Err, wait, where did the 3rd line go?

In fact this // is short for /descendant-or-self::node()/, so ./descendant-or-self::node()/text() will select children text nodes of only those next <p> (text nodes don't have children, self::text()/text() will never match any text node)

>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()/descendant-or-self::node()/text()").extract()
[u' Line 1 ', u' Line 2 ']

What you can do here is to use the handy descendant-or-self axis + the text() node test, so if following-sibling::node() got to a text node, the "self" in descendant-or-self will match the text node, and text() node test with be true

>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()/descendant-or-self::text()").extract()
[u'\n ', u' Line 1 ', u'\n ', u' Line 2 ', u'\nLine 3\n']

Using the example URL from OP's edited question:

$ scrapy shell https://www.tdcj.state.tx.us/death_row/dr_info/wardadamlast.html
2016-05-19 13:14:44 [scrapy] INFO: Scrapy 1.1.0 started (bot: scrapybot)
2016-05-19 13:14:44 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'}
(...)
2016-05-19 13:14:48 [scrapy] INFO: Spider opened
2016-05-19 13:14:50 [scrapy] DEBUG: Crawled (200) <GET https://www.tdcj.state.tx.us/death_row/dr_info/wardadamlast.html> (referer: None)

>>> t = response.xpath("""
...     //div/p[contains(., 'Last Statement:')]
...         /following-sibling::node()
...             /descendant-or-self::text()""").extract()
>>> 
>>> 
>>> print(''.join(t))

I would like to thank everyone that has showed up on my  behalf, Kathryn Cox, I love you dearly.  Thank you Randy Cannon for  showing up and being a lifelong friend.  Thank you Dr. Steve Ball for  trying to bring the right out.  There are a lot of injustices that are  happening with this.  This is wrong.  Thank you Reverend Leon  Harrison for showing me the grace of God.  Thank you for all of my friends  that are out there.  This is not a capital case.  I never had  intended to do anything.  I feel very grieved for the loss of Walker, and  for Donovan and Marissa Walker.  I hope they can find peace and be  productive in society.  I would like to thank all of my friends on the row  even though everything didn’t work, close isn’t good enough.  I hope that  positive change will come out of this.
I would like to thank my father and mother for everything  that they showed me.  I would like to apologize for putting them through  this.  I would like to ask for the truth to come out and make positive  changes.  Above all else Donovan and Marissa can find love and  peace.  I hope they overcome the loss of their father.  At no time  did I intend to hurt him.
When  the truth comes out I hope that they can find closure.  There are a lot of  things that are not right in this world, I have had to overcome them  myself.  I hope all that are on the row, I hope they find peace and solace  in their life. Everyone can find peace in a Christian God or whatever God they  believe in.  I thank you mom and dad for everything, I love you  dearly.  One last thing, I thank all of my friends that showed loyalty and  graced my life with more positive.  I would also like to thank Gustav’s  mother for having such a great son, and showing me much love.  I have met  good people on the row, not all of them are bad.  I hope everyone can see  that.  I just want to thank everybody that came to witness this.  I  thank everyone, I am sorry things didn’t work out.  May God forgive us  all?  I am sorry mother and I am sorry father.  I hope you find peace  and solace in your heart.  I know there is something else I need to  say.  I feel that.    
like image 84
paul trmbrth Avatar answered Sep 27 '22 23:09

paul trmbrth