While parsing an indented XML, non-significant white space text nodes are created from the white spaces between a closing and an opening tag. For example, from the following XML:
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
whose string representation is as follows,
"<note>\n <to>Tove</to>\n <from>Jani</from>\n <heading>Reminder</heading>\n <body>Don't forget me this weekend!</body>\n</note>\n"
the following Document
is created:
#(Document:0x3fc07e4540d8 {
name = "document",
children = [
#(Element:0x3fc07ec8629c {
name = "note",
children = [
#(Text "\n "),
#(Element:0x3fc07ec8089c {
name = "to",
children = [ #(Text "Tove")]
}),
#(Text "\n "),
#(Element:0x3fc07e8d8064 {
name = "from",
children = [ #(Text "Jani")]
}),
#(Text "\n "),
#(Element:0x3fc07e8d588c {
name = "heading",
children = [ #(Text "Reminder")]
}),
#(Text "\n "),
#(Element:0x3fc07e8cf590 {
name = "body",
children = [ #(Text "Don't forget me this weekend!")]
}),
#(Text "\n")]
})]
})
Here, there are lots of white space nodes of type Nokogiri::XML::Text
.
I would like to count the children
of each node in a Nokogiri XML Document
, and access the first or last child, excluding non-significant white spaces. I wish not to parse them, or distinguish between those and significant text nodes such as those inside the element <to>
, like "Tove"
. Here is an rspec of what I am looking for:
require 'nokogiri'
require_relative 'spec_helper'
xml_text = <<XML
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
XML
xml = Nokogiri::XML(xml_text)
def significant_nodes(node)
return 0
end
describe "Stackoverflow Question" do
it "should return the number of significant nodes in nokogiri." do
expect(significant_nodes(xml.css('note'))).to eq 4
end
end
I want to know how to create the significant_nodes
function.
If I change the XML to:
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
<footer></footer>
</note>
then when I create the Document
, I still would like the footer represented; using config.noblanks
is not an option.
You can use the NOBLANKS
option for parsing the XML string, consider this example:
require 'nokogiri'
string = "<foo>\n <bar>bar</bar>\n</foo>"
puts string
# <foo>
# <bar>bar</bar>
# </foo>
document_with_blanks = Nokogiri::XML.parse(s)
document_without_blanks = Nokogiri::XML.parse(s) do |config|
config.noblanks
end
document_with_blanks.root.children.each { |child| p child }
#<Nokogiri::XML::Text:0x3ffa4e153dac "\n ">
#<Nokogiri::XML::Element:0x3fdce3f78488 name="bar" children=[#<Nokogiri::XML::Text:0x3fdce3f781f4 "bar">]>
#<Nokogiri::XML::Text:0x3ffa4e15335c "\n">
document_without_blanks.root.children.each { |child| p child }
#<Nokogiri::XML::Element:0x3f81bef42034 name="bar" children=[#<Nokogiri::XML::Text:0x3f81bef43ee8 "bar">]>
The NOBLANKS
shouldn't remove empty nodes:
doc = Nokogiri.XML('<foo><bar></bar></foo>') do |config|
config.noblanks
end
doc.root.children.each { |child| p child }
#<Nokogiri::XML::Element:0x3fad0fafbfa8 name="bar">
As OP pointed out the documentation on the Nokogiri website (and also on the libxml website) about the parser options is quite cryptic, following a specification of the behaviour ot the NOBLANKS
option:
require 'rspec/autorun'
require 'nokogiri'
def parse_xml(xml_string)
Nokogiri.XML(xml_string) { |config| config.noblanks }
end
describe "Nokogiri NOBLANKS parser option" do
it "removes whitespace nodes if they have siblings" do
doc = parse_xml("<root>\n <child></child></root>")
expect(doc.root.children.size).to eq(1)
expect(doc.root.children.first).to be_kind_of(Nokogiri::XML::Node)
end
it "doesn't remove whitespaces nodes if they have no siblings" do
doc = parse_xml("<root>\n </root>")
expect(doc.root.children.size).to eq(1)
expect(doc.root.children.first).to be_kind_of(Nokogiri::XML::Text)
end
it "doesn't remove empty nodes" do
doc = parse_xml('<root><child></child></root>')
expect(doc.root.children.size).to eq(1)
expect(doc.root.children.first).to be_kind_of(Nokogiri::XML::Node)
end
end
You can create a query that only returns element nodes, and ignores text nodes. In XPath, *
only returns elements, so the query could look like (querying the whole doc):
doc.xpath('//note/*')
or if you want to use CSS:
doc.css('note > *')
If you want to implement your significant_nodes
method, you would need to make the query relative to the node passed in:
def significant_nodes(node)
node.xpath('./*').size
end
I don’t know how to do a relative query with CSS, you might need to stick with XPath.
Nokogiri's noblanks config option doesn't remove all whitespace Text nodes when they have siblings:
describe "Nokogiri NOBLANKS parser option" do
it "doesn't remove whitespace Text nodes if they're surrounded by non-whitespace Text node siblings" do
doc = parse_xml("<root>1 <two></two> \n <three></three> \n <four></four> 5</root>")
children = doc.root.children
expect(children.size).to_not eq(5)
expect(children.size).to eq(7) #Because the two newline Text nodes are not ignored
expect(doc.root.children.first).to be_kind_of(Nokogiri::XML::Node)
end
end
I'm not sure why Nokogiri was programmed to work that way. I think it would be better to either ignore all whitespace Text nodes are don't ignore any Text nodes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With