Well my initial code works but misses out some weird formatting in the site:
response.xpath("//*[contains(., 'Description:')]/following-sibling::p/text()").extract()
<div id="body">
<a name="main_content" id="main_content"></a>
<!-- InstanceBeginEditable name="main_content" -->
<div class="return_to_div"><a href="../../index.html">HOME</a> | <a href="../index.html">DEATH ROW</a> | <a href="index.html">INFORMATION</a> | text</div>
<h1>text</h1>
<h2>text</h2>
<p class="text_bold">text:</p>
<p>text</p>
<p class="text_bold">text:</p>
<p>text</p>
<p class="text_bold">Description:</p>
<p>Line1</p>
<p>Line2</p>
Line3 <!-- InstanceEndEditable -->
</div>
I have no problem pulling Line 1 and Line 2. Line 3 however is not a sibling of my P class. This only occurs for some of the pages I am trying to scrap from a table.
Here is the link: https://www.tdcj.state.tx.us/death_row/dr_info/wardadamlast.html
Sorry Xpath just confuses me, is there a way to extract all data thats after the criteria //*[contains(., 'Description:')]
rather then having to be a sibling?
Thanks in advance.
Edited: Changed example to more reflect the actual. Added link to original page.
When you are using text nodes in a XPath string function, then use . (dot) instead of using .//text(), because this produces the collection of text elements called as node-set.
Description. /html/head/title − This will select the <title> element, inside the <head> element of an HTML document. /html/head/title/text() − This will select the text within the same <title> element. //td − This will select all the elements from <td>.
Find the first element by CSS selector. Find the first element by tag name. Find the first element by partial link text. In order to get the first element of an ID or name of an element, we can use the XPath to display the first value of an element.
You can select all sibling nodes (elements and text nodes) after that <p>
containing "Description:" (following-sibling::node()
) and then fetch all text nodes (descendant-or-self::text()
):
>>> import scrapy
>>> response = scrapy.Selector(text="""<div>
... <p> Name </p>
... <p> Age </p>
... <p class="text-bold"> Description: </p>
... <p> Line 1 </p>
... <p> Line 2 </p>
... Line 3
... </div>""", type="html")
>>> response.xpath("""//div/p[contains(., 'Description:')]
... /following-sibling::node()
... /descendant-or-self::text()""").extract()
[u'\n ', u' Line 1 ', u'\n ', u' Line 2 ', u'\nLine 3\n']
>>>
Let's break it down.
So, you already know how to locate the correct <p>
containing "Description" (with XPath //div/p[contains(., 'Description:')]
):
>>> response.xpath("//div/p[contains(., 'Description:')]").extract()
[u'<p class="text-bold"> Description: </p>']
You want <p>
s that come after (following-sibling::
axis + p
element selection):
>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::p").extract()
[u'<p> Line 1 </p>', u'<p> Line 2 </p>']
This doesn't give you the 3rd line. So you read about XPath and try the "catch-all" *
:
>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::*").extract()
[u'<p> Line 1 </p>', u'<p> Line 2 </p>']
Still no luck. Why? because *
only selects elements (commonly referred to as "tags", to simplify).
The 3rd line you're after is a text node, child of the parent <div>
element. But a text node is also a node (!) so you can select it as a sibling of that famous <p>
above:
>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()").extract()
[u'\n ', u'<p> Line 1 </p>', u'\n ', u'<p> Line 2 </p>', u'\nLine 3\n']
Ok, so now it seems we have the nodes we want ("tag" elements and text nodes). But you still got those "<p>
" in the output of .extract()
(the XPath selected the elements, not their "inner" text).
So you read about XPath more and use the .//text()
step (roughly "all children text nodes from here")
>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()//text()").extract()
[u' Line 1 ', u' Line 2 ']
Err, wait, where did the 3rd line go?
In fact this //
is short for /descendant-or-self::node()/
, so ./descendant-or-self::node()/text()
will select children text nodes of only those next <p>
(text nodes don't have children, self::text()/text()
will never match any text node)
>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()/descendant-or-self::node()/text()").extract()
[u' Line 1 ', u' Line 2 ']
What you can do here is to use the handy descendant-or-self
axis + the text()
node test, so if following-sibling::node()
got to a text node, the "self" in descendant-or-self
will match the text node, and text()
node test with be true
>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()/descendant-or-self::text()").extract()
[u'\n ', u' Line 1 ', u'\n ', u' Line 2 ', u'\nLine 3\n']
Using the example URL from OP's edited question:
$ scrapy shell https://www.tdcj.state.tx.us/death_row/dr_info/wardadamlast.html
2016-05-19 13:14:44 [scrapy] INFO: Scrapy 1.1.0 started (bot: scrapybot)
2016-05-19 13:14:44 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'}
(...)
2016-05-19 13:14:48 [scrapy] INFO: Spider opened
2016-05-19 13:14:50 [scrapy] DEBUG: Crawled (200) <GET https://www.tdcj.state.tx.us/death_row/dr_info/wardadamlast.html> (referer: None)
>>> t = response.xpath("""
... //div/p[contains(., 'Last Statement:')]
... /following-sibling::node()
... /descendant-or-self::text()""").extract()
>>>
>>>
>>> print(''.join(t))
I would like to thank everyone that has showed up on my behalf, Kathryn Cox, I love you dearly. Thank you Randy Cannon for showing up and being a lifelong friend. Thank you Dr. Steve Ball for trying to bring the right out. There are a lot of injustices that are happening with this. This is wrong. Thank you Reverend Leon Harrison for showing me the grace of God. Thank you for all of my friends that are out there. This is not a capital case. I never had intended to do anything. I feel very grieved for the loss of Walker, and for Donovan and Marissa Walker. I hope they can find peace and be productive in society. I would like to thank all of my friends on the row even though everything didn’t work, close isn’t good enough. I hope that positive change will come out of this.
I would like to thank my father and mother for everything that they showed me. I would like to apologize for putting them through this. I would like to ask for the truth to come out and make positive changes. Above all else Donovan and Marissa can find love and peace. I hope they overcome the loss of their father. At no time did I intend to hurt him.
When the truth comes out I hope that they can find closure. There are a lot of things that are not right in this world, I have had to overcome them myself. I hope all that are on the row, I hope they find peace and solace in their life. Everyone can find peace in a Christian God or whatever God they believe in. I thank you mom and dad for everything, I love you dearly. One last thing, I thank all of my friends that showed loyalty and graced my life with more positive. I would also like to thank Gustav’s mother for having such a great son, and showing me much love. I have met good people on the row, not all of them are bad. I hope everyone can see that. I just want to thank everybody that came to witness this. I thank everyone, I am sorry things didn’t work out. May God forgive us all? I am sorry mother and I am sorry father. I hope you find peace and solace in your heart. I know there is something else I need to say. I feel that.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With