Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What XPath can I use to get all text nodes after and including the first paragraph node?

I'm new to Nokogiri, and Ruby in general.

I want to get the text of all the nodes in the document, starting from and inclusive of the first paragraph node.

I tried the following with XPath but I'm getting nowhere:

 puts page.search("//p[0]/text()[next-sibling::node()]")

This doesn't work. What do I have to change?

like image 504
user1895623 Avatar asked Apr 07 '13 19:04

user1895623


2 Answers

You have to find the <p/> node and return all text() nodes, both inside and following. Depending what XPath capabilities Nokogiri has, use one of these queries:

//p[1]/(descendant::text() | following::text())

If it doesn't work, use this instead, which needs to find the first paragraph twice and can be a little bit, but probably unnoticeably, slower:

(//p[1]/descendant::text() | //p[1]/following::text())

A probably unsupported XPath 2.0 alternative would be:

//text()[//p[1] << .]

which means "all text nodes preceded by the first <p/> node in document".

like image 171
Jens Erat Avatar answered Sep 28 '22 07:09

Jens Erat


This works with Nokogiri (which stands on top of libxml2 and supports XPath 1.0 expressions):

//p[1]//text() | //p[1]/following::text()

Proof:

require 'nokogiri'

html = '<body><h1>A</h1><p>B <b>C</b></p><p>D <b>E</b></p></body>'
doc = Nokogiri.HTML(html)

p doc.xpath('//p[1]//text() | //p[1]/following::text()').map(&:text)
#=> ["B ", "C", "D ", "E"]

Note that just selecting the text nodes themselves returns a NodeSet of Nokogiri::XML::Text objects, and so if you want only the text contents of them you must map them via the .text (or .content) methods.

like image 26
Phrogz Avatar answered Sep 28 '22 08:09

Phrogz