I use Nokogiri for parse the html page with same content:
<p class="parent">
Useful text
<br>
<span class="child">Useless text</span>
</p>
When I call the method page.css('p.parent').text
Nokogiri returns 'Useful text Useless text'. But I need only 'Useful text'.
How to get node text without children?
The textNodes of any element can be selected using jQuery by selecting all the nodes and using the filter() method to check the nodeType property. The required element is first selected using the jQuery selector. The contents() method is used on selected elements.
To get all child nodes of an element, you can use the childNodes property. This property returns a collection of a node's child nodes, as a NodeList object. By default, the nodes in the collection are sorted by their appearance in the source code. You can use a numerical index (start from 0) to access individual nodes.
A text node encapsulates XML character content. A text node can have zero or one parent. The content of a text node can be empty. However, unless the parent of a text node is empty, the content of the text node cannot be an empty string.
XPath includes the text()
node test for selecting text nodes, so you could do:
page.xpath('//p[@class="parent"]/text()')
Using XPath to select HTML classes can become quite tricky if the element in question could belong to more than one class, so this might not be ideal.
Fortunately Nokogiri adds the text()
selector to CSS, so you can use:
page.css('p.parent > text()')
to get the text nodes that are direct children of p.parent
. This will also return some nodes that are whtespace only, so you may have to filter them out.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With