Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I speed up XML::Twig

I am using XML::Twig to parse through a very large XML document. I want to split it into chunks based on the <change></change> tags.

Right now I have:

my $xml = XML::Twig->new(twig_handlers => { 'change' => \&parseChange, });
$xml->parsefile($LOGFILE);

sub parseChange {

  my ($xml, $change) = @_;

  my $message = $change->first_child('message');
  my @lines   = $message->children_text('line');

  foreach (@lines) {
    if ($_ =~ /[^a-zA-Z0-9](?i)bug(?-i)[^a-zA-Z0-9]/) {
      print outputData "$_\n";
    }
  }

  outputData->flush();
  $change->purge;
}

Right now this is running the parseChange method when it pulls that block from the XML. It is going extremely slow. I tested it against reading the XML from a file with $/=</change> and writing a function to return the contents of an XML tag and it went much faster.

Is there something I'm missing or am I using XML::Twig incorrectly? I'm new to Perl.

EDIT: Here is an example change from the changes file. The file consists of a lot of these one right after the other and there should not be anything in between them:

<change>
<project>device_common</project>
<commit_hash>523e077fb8fe899680c33539155d935e0624e40a</commit_hash>
<tree_hash>598e7a1bd070f33b1f1f8c926047edde055094cf</tree_hash>      
<parent_hashes>71b1f9be815b72f925e66e866cb7afe9c5cd3239</parent_hashes>      
<author_name>Jean-Baptiste Queru</author_name>      
<author_e-mail>[email protected]</author_e-mail>      
<author_date>Fri Apr 22 08:32:04 2011 -0700</author_date>      
<commiter_name>Jean-Baptiste Queru</commiter_name>      
<commiter_email>[email protected]</commiter_email>      
<committer_date>Fri Apr 22 08:32:04 2011 -0700</committer_date>      
<subject>chmod the output scripts</subject>      
<message>         
    <line>Change-Id: Iae22c67066ba4160071aa2b30a5a1052b00a9d7f</line>      
</message>      
<target>         
    <line>generate-blob-scripts.sh</line>      
</target>   
</change>
like image 905
user1897691 Avatar asked Oct 06 '22 16:10

user1897691


1 Answers

As it stands, your program is processing all of the XML document, including the data outside the change elements that you aren't interested in.

If you change the twig_handlers parameter in your constructor to twig_roots, then the tree structures will be built for only the elements of interest and the rest will be ignored.

my $xml = XML::Twig->new(twig_roots => { change => \&parseChange });
like image 81
Borodin Avatar answered Oct 10 '22 03:10

Borodin