I have some HTML that looks like:
<dt>
<a href="#">Hello</a>
(2009)
</dt>
I already have all my HTML loaded into a variable called record
. I need to parse out the year i.e. 2009 if it exists.
How can I get the text inside the dt
tag but not the text inside the a
tag? I've used record.search("dt").inner_text
and this gives me everything.
It's a trivial question but I haven't managed to figure this out.
To get all the direct children with text, but not any further sub-children, you can use XPath like so:
doc.xpath('//dt/text()')
Or if you wish to use search:
doc.search('dt').xpath('text()')
Using XPath to select exactly what you want (as suggested by @Casper) is the right answer.
def own_text(node)
# Find the content of all child text nodes and join them together
node.xpath('text()').text
end
Here's an alternative, fun answer :)
def own_text(node)
node.clone(1).tap{ |copy| copy.element_children.remove }.text
end
Seen in action:
require 'nokogiri'
root = Nokogiri.XML('<r>hi <a>BOO</a> there</r>').root
puts root.text #=> hi BOO there
puts own_text(root) #=> hi there
The dt
element has two children, so you can access it by:
doc.search("dt").children.last.text
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With