Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl XML::LibXML: how to access comment nodes

For the life of me I can't figure out the proper code to access the comment lines in my XML file. Do I use findnodes, find, getElementByTagName (doubt it).

Am I even making the correct assumption that these comment lines are accessible? I would hope so, as I know I can add a comment.

The type number for a comment node is 8, so they must be parseable.

Ultimately, what I want tot do is delete them.

my @nodes = $dom->findnodes("//*");

foreach my $node (@nodes) {
  print $node->nodeType, "\n";
}

<TT>
 <A>xyz</A>
 <!-- my comment -->
</TT> 
like image 879
CraigP Avatar asked Oct 17 '13 16:10

CraigP


3 Answers

  • If all you need to do is produce a copy of the XML with comment nodes removed, then the first parameter of toStringC14N is a flag that says whether you want comments in the output. Omitting all parameters implicitly sets the first to a false value, so

    $doc->toStringC14N
    

will reproduce the XML trimmed of comments. Note that the Canonical XML form specified by C14N doesn't include an XML declaration header. It is always XML 1.0 encoded in UTF-8.

  • If you need to remove the comments from the in-memory structure of the document before processing it further, then findnodes with the XPath expression //comment() will locate them for you, and unbindNode will remove them from the XML.

This program demonstrates

use strict;
use warnings;

use XML::LibXML;

my $doc = XML::LibXML->load_xml(string => <<END_XML);
<TT>
 <A>xyz</A>
 <!-- my comment -->
</TT>
END_XML

# Print everything
print $doc->toString, "\n";

# Print without comments
print $doc->toStringC14N, "\n\n";

# Remove comments and print everything
$_->unbindNode for $doc->findnodes('//comment()');
print $doc->toString;

output

<?xml version="1.0"?>
<TT>
 <A>xyz</A>
 <!-- my comment -->
</TT>

<TT>
 <A>xyz</A>

</TT>

<?xml version="1.0"?>
<TT>
 <A>xyz</A>

</TT>



Update

To select a specific comment, you can add a predicate expression to the XPath selector. To find the specific comment in your example data you could write

$doc->findnodes('//comment()[. = " my comment "]')

Note that the text of the comment includes everything except the leading and trailing --, so spaces are significant as shown in that call.

If you want to make things a bit more lax, you could use normalize=space, which removes leading and trailing whitespace, and contracts every sequence of whitespace within the string to a single space. Now you can write

$doc->findnodes('//comment()[normalize-space(.) = "my comment"]')

And the same call would find your comment even if it looked like this.

<!--
my
comment
-->

Finally, you can make use of contains, which, as you would expect, simply checks whether one string contains another. Using that you could write

$doc->findnodes('//comment()[contains(., "comm")]')

The one to choose depends on your requirement and your situation.

like image 111
Borodin Avatar answered Oct 15 '22 07:10

Borodin


According to the XPath spec:

  • * is a test that matches element nodes of any name. Comment nodes aren't element nodes.

  • comment() is a test that matches comment nodes.

Untested:

for $comment_node ($doc->findnodes('//comment()')) {
   $comment_node->parentNode->removeChild($comment_node);
}
like image 31
ikegami Avatar answered Oct 15 '22 06:10

ikegami


I know it's not XML::LibXML but here you have another way to remove comments easily with XML::Twig module:

#!/usr/bin/env perl

use warnings;
use strict;
use XML::Twig;

my $twig = XML::Twig->new(
    pretty_print => 'indented',
    comments => 'drop'
)->parsefile( shift )->print;

Run it like:

perl script.pl xmlfile

That yields:

<TT>
  <A>xyz</A>
</TT>

The comments option has also the value process that lets you work with them using the xpath value of #COMMENT.

like image 2
Birei Avatar answered Oct 15 '22 07:10

Birei