I'm new to Nokogiri, and Ruby in general.
I want to get the text of all the nodes in the document, starting from and inclusive of the first paragraph node.
I tried the following with XPath but I'm getting nowhere:
puts page.search("//p[0]/text()[next-sibling::node()]")
This doesn't work. What do I have to change?
You have to find the <p/>
node and return all text()
nodes, both inside and following. Depending what XPath capabilities Nokogiri has, use one of these queries:
//p[1]/(descendant::text() | following::text())
If it doesn't work, use this instead, which needs to find the first paragraph twice and can be a little bit, but probably unnoticeably, slower:
(//p[1]/descendant::text() | //p[1]/following::text())
A probably unsupported XPath 2.0 alternative would be:
//text()[//p[1] << .]
which means "all text nodes preceded by the first <p/>
node in document".
This works with Nokogiri (which stands on top of libxml2 and supports XPath 1.0 expressions):
//p[1]//text() | //p[1]/following::text()
Proof:
require 'nokogiri'
html = '<body><h1>A</h1><p>B <b>C</b></p><p>D <b>E</b></p></body>'
doc = Nokogiri.HTML(html)
p doc.xpath('//p[1]//text() | //p[1]/following::text()').map(&:text)
#=> ["B ", "C", "D ", "E"]
Note that just selecting the text nodes themselves returns a NodeSet
of Nokogiri::XML::Text
objects, and so if you want only the text contents of them you must map them via the .text
(or .content
) methods.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With