Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Purge XML Twig inside sub handler

Tags:

xml

perl

xml-twig

I am parsing large XML files (60GB+) with XML::Twig and using it in a OO (Moose) script. I am using the twig_handlers option to parse elements as soon as they're read into memory. However, I'm not sure how I can deal with the Element and Twig.

Before I used Moose (and OO altogether), my script looked as follows (and worked):

my $twig = XML::Twig->new(
  twig_handlers => {
    $outer_tag => \&_process_tree,
  }
);
$twig->parsefile($input_file);


sub _process_tree {
  my ($fulltwig, $twig) = @_;

  $twig->cut;
  $fulltwig->purge;
  # Do stuff with twig
}

And now I'd do it like this.

my $twig = XML::Twig->new(
  twig_handlers => {
    $self->outer_tag => sub {
      $self->_process_tree($_);
    }
  }
);
$twig->parsefile($self->input_file);

sub _process_tree {
  my ($self, $twig) = @_;

  $twig->cut;
  # Do stuff with twig
  # But now the 'full twig' is not purged
}

The thing is that I now see that I am missing the purging of the fulltwig. I figured that - in the first, non-OO version - purging would help on saving memory: getting rid of the fulltwig as soon as I can. However, when using OO (and having to rely on an explicit sub{} inside the handler) I don't see how I can purge the full twig because the documentation says that

$_ is also set to the element, so it is easy to write inline handlers like

para => sub { $_->set_tag( 'p'); }

So they talk about the Element you want to process, but not the fulltwig itself. So how can I delete that if it is not passed to the subroutine?

like image 417
Bram Vanroy Avatar asked Jul 23 '17 09:07

Bram Vanroy


1 Answers

The handler still gets the full twig, you're just not using it (using $_ instead).

As it turns out you can still call purge on the twig (which I usually call "element", or elt in the docs): $_->purge will work as expected, purging the full twig up to the current element in $_;

A cleaner (IMHO) way would be to actually get all of the parameters and purge the full twig expicitely:

my $twig = XML::Twig->new(
  twig_handlers => {
    $self->outer_tag => sub {
      $self->_process_tree(@_); # pass _all_ of the arguments
    }
  }
);
$twig->parsefile($self->input_file);

sub _process_tree {
  my ($self, $full_twig, $twig) = @_; # now you see them!

  $twig->cut;
  # Do stuff with twig
  $full_twig->purge;  # now you don't
}
like image 196
mirod Avatar answered Nov 05 '22 20:11

mirod