Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl, XML::Twig, how to reading field with the same tag

Tags:

xml

perl

xml-twig

I'm working on processing a XML file I receive from a partner. I do not have any influence on changing the makeup of this xml file. An extract of the XML is:

<?xml version="1.0" encoding="UTF-8"?>
<objects>
  <object>
    <id>VW-XJC9</id>
    <name>Name</name>
    <type>House</type>
    <description>
    <![CDATA[<p>some descrioption of the house</p>]]> </description>
    <localcosts>
      <localcost>
        <type>mandatory</type>
        <name>What kind of cost</name>
        <description>
          <![CDATA[Some text again, different than the first tag]]>
        </description>
      </localcost>
    </localcosts>
  </object>
</objects>

The reason I use Twig is that this XML is about 11GB big, about 100000 different objects) . The problem is when I reach the localcosts part, the 3 fields (type, name and description) are skipped, probably because these names are already used before.

The code I use to go through the xml file is as follows:

my $twig= new XML::Twig( twig_handlers => { 
                 id                            => \&get_ID,
                 name                          => \&get_Name,
                 type                          => \&get_Type,
                 description                   => \&get_Description,
                 localcosts                    => \&get_Localcosts
});

$lokaal="c:\\temp\\data3.xml";
getstore($xml, $lokaal);
$twig->parsefile("$lokaal");

sub get_ID          { my( $twig, $data)= @_;  $field[0]=$data->text; $twig->purge; } 
sub get_Name        { my( $twig, $data)= @_;  $field[1]=$data->text; $twig->purge; }
sub get_Type        { my( $twig, $data)= @_;  $field[3]=$data->text; $twig->purge; }
sub get_Description { my( $twig, $data)= @_;  $field[8]=$data->text; $twig->purge; }
sub get_Localcosts{

  my ($t, $item) = @_;

  my @localcosts = $item->children;
  for my $localcost ( @localcosts ) {
    print "$field[0]: $localcost->text\n";
    my @costs = $localcost->children;
    for my $cost (@costs) {
      $Type       =$cost->text if $cost->name eq q{type};
      $Name       =$cost->text if $cost->name eq q{name};
      $Description=$cost->text if $cost->name eq q{description};
      print "Fields: $Type, $Name, $Description\n";
    }
  }
  $t->purge;    
}

when I run this code, the main fields are read without issues, but when the code arrives at the 'localcosts' part, the second for-next loop is not executed. When I change the field names in the xml to unique ones, this code works perfectly.

Can someone help me out?

Thanks

like image 441
user2970543 Avatar asked Jun 08 '14 14:06

user2970543


3 Answers

If you want the handlers for type, name and desctiption only be triggered in the object tag, specify the path:

my $twig = new XML::Twig( twig_handlers => { 
                 id                    => \&get_ID,
                 'object/name'         => \&get_Name,
                 'object/type'         => \&get_Type,
                 'object/description'  => \&get_Description,
                 localcosts            => \&get_Localcosts
    });
like image 110
choroba Avatar answered Oct 02 '22 03:10

choroba


The problem is that the id, name, type and description handlers are being executed for both occurrences. You will find that the contents of the @fields is from the localcost values, as the data from the object values has been overwritten.

Also, in handling the localcost elements, the handlers have done a $twig->purge, which removes the data from memory. So when the localcosts handler is called it finds the element empty

I think the easiest way to do this is to write a single handler that processes each object node in one go and then purges it

This program demonstrates. Note that I have used Data::Dumper only so that you can see the contents of @fields once it has been populated

It is very important that you use strict and use warnings at the top of every Perl program, especially if you are asking for help with it. It is a simple measure that can reveal many straightforward errors that you may otherwise waste a lot of time searching for

Note also that the "indirect object" form of method calls is discouraged: you should write XML::Twig->new(...) instead of new XML::Twig (...).

And if you use single quotes instead of double quotes then a backslash inside a string doesn't need to be doubled-up unless it is the last character of the string. But Perl is quite happy if you use forward slashes as a path separator, even on Windows

I hope this helps

use strict;
use warnings;

use XML::Twig;
use Data::Dumper;
$Data::Dumper::Useqq = 1;

my $twig= XML::Twig->new( twig_handlers => { object => \&get_Object });

my $lokaal = 'c:\temp\data3.xml';

my @fields;
$twig->parsefile($lokaal);


sub get_Object {

  my ($twig, $object) = @_;

  $fields[0] = $object->findvalue('id');
  $fields[1] = $object->findvalue('name');
  $fields[3] = $object->findvalue('type');
  $fields[8] = $object->findvalue('description');

  print Dumper \@fields;

  my @localcosts = $object->findnodes('localcosts/localcost');

  for my $localcost (@localcosts) {

    my $type        = $localcost->findvalue('type');
    my $name        = $localcost->findvalue('name');
    my $description = $localcost->findvalue('description');

    print "$type, $name, $description\n";
  }

  $twig->purge;    
}

output

$VAR1 = [
          "VW-XJC9",
          "Name",
          undef,
          "House",
          undef,
          undef,
          undef,
          undef,
          "<p>some descrioption of the house</p> "
        ];
mandatory, What kind of cost, Some text again, different than the first tag
like image 32
Borodin Avatar answered Oct 04 '22 03:10

Borodin


As Borodin said, if you have handlers on name, type and description, and you call $twig->purge at the end of each handler, then the elements are removed from the tree. You could set a handler on object, that only does a $twig->purge call, and you would be OK.

You don't need to call purge "too often", just make sure you call it at a low enough level so you don't use too much memory. There is no point really in calling it for each single leaf element.

That's a common mistake, one that I make myself quite often ;--(.

like image 32
mirod Avatar answered Oct 03 '22 03:10

mirod