Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I get rid of unwanted nodes returned by findnodes from Perl's XML::LibXML module?

Tags:

xml

perl

libxml2

Following is just small fraction of the XML I am working on. I want to extract all attributes, tag name and texts under the substree.

<?xml version='1.0' encoding='UTF-8'?>
<Warehouse>
<Equipment id="ABC001" model="TV" version="3_00">
<attributes>
<Location>Chicago</Location>
<Latitude>30.970</Latitude>
<Longitude>-90.723</Longitude>
</attributes>
</Equipment></Warehouse>

I have coded example like this:

#!/usr/bin/perl
use XML::LibXML;
use Data::Dumper;

$parser = XML::LibXML->new();
$Chunk = $parser->parse_file("numone.xml");

@Equipment = $Chunk->findnodes('//Equipment');
foreach $at ($Equipment[0]->getAttributes()) {
    ($na,$nv) = ($at -> getName(),$at -> getValue());
    print "$na => $nv\n";
}

@Equipment = $Chunk->findnodes('//Equipment/attributes');
@Attr = $Equipment[0]->childNodes;
print Dumper(@Attr);

foreach $at (@Attr) {
    ($na,$nv) = ($at->nodeName, $at->textContent);
    print "$na => $nv\n";
}

I am getting the results like this:

id => ABC001
model => TV
version => 3_00
$VAR1 = bless( do{\(my $o = 10579528)}, 'XML::LibXML::Text' );
$VAR2 = bless( do{\(my $o = 13643928)}, 'XML::LibXML::Element' );
$VAR3 = bless( do{\(my $o = 13657192)}, 'XML::LibXML::Text' );
$VAR4 = bless( do{\(my $o = 13011432)}, 'XML::LibXML::Element' );
$VAR5 = bless( do{\(my $o = 10579752)}, 'XML::LibXML::Text' );
$VAR6 = bless( do{\(my $o = 10565696)}, 'XML::LibXML::Element' );
$VAR7 = bless( do{\(my $o = 13046400)}, 'XML::LibXML::Text' );
#text =>

Location => Chicago
#text =>

Latitude => 30.970
#text =>

Longitude => -90.723
#text =>

Extract attributes seem OK, However extracting tag name and text got extra contents. My questions are:

  1. Where are those ::Text element came from?
  2. How do I get rid of those extra elements and #text things?

Thanks,

like image 398
mkt2012 Avatar asked Dec 28 '25 16:12

mkt2012


2 Answers

First of all you really should use strict and use warnings at the start of your program, and declare all variables at the point of first use with my. This will reveal a lot of simple mistakes and is especially important in programs you are asking for help with.

As you have been told, the XML::LibXML::Text entries are whitespace text nodes. If you want the XML::LibXML parser to ignore then then set the no_blanks option on the parser object.

Also, you would be better off using the more recent load_xml method instead of the outdated parse_file as below

my $parser = XML::LibXML->new(no_blanks => 1);
my $Chunk = $parser->load_xml(location => "numone.xml");

The output from this changed version of the program looks like

id => ABC001
model => TV
version => 3_00
$VAR1 = bless( do{\(my $o = 7008120)}, 'XML::LibXML::Element' );
$VAR2 = bless( do{\(my $o = 7008504)}, 'XML::LibXML::Element' );
$VAR3 = bless( do{\(my $o = 7008144)}, 'XML::LibXML::Element' );
Location => Chicago
Latitude => 30.970
Longitude => -90.723
like image 167
Borodin Avatar answered Dec 31 '25 11:12

Borodin


The extra nodes are text nodes that contain only whitespace, e.g., the newlines between elements. Skip them if you want:

@Equipment = $Chunk->findnodes('//Equipment/attributes');
@Attr = $Equipment[0]->childNodes;
foreach $at (@Attr) {
    ($na,$nv) = ($at->nodeName, $at->textContent);

    next if $na eq "#text";  # skip text nodes between elements

    print "$na => $nv\n";
}

Output:

id => ABC001
model => TV
version => 3_00
Location => Chicago
Latitude => 30.970
Longitude => -90.723
like image 21
Greg Bacon Avatar answered Dec 31 '25 11:12

Greg Bacon



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!