I am using XML::Twig
to parse through a very large XML document. I want to split it into chunks based on the <change></change>
tags.
Right now I have:
my $xml = XML::Twig->new(twig_handlers => { 'change' => \&parseChange, });
$xml->parsefile($LOGFILE);
sub parseChange {
my ($xml, $change) = @_;
my $message = $change->first_child('message');
my @lines = $message->children_text('line');
foreach (@lines) {
if ($_ =~ /[^a-zA-Z0-9](?i)bug(?-i)[^a-zA-Z0-9]/) {
print outputData "$_\n";
}
}
outputData->flush();
$change->purge;
}
Right now this is running the parseChange
method when it pulls that block from the XML. It is going extremely slow. I tested it against reading the XML from a file with $/=</change>
and writing a function to return the contents of an XML tag and it went much faster.
Is there something I'm missing or am I using XML::Twig
incorrectly? I'm new to Perl.
EDIT: Here is an example change from the changes file. The file consists of a lot of these one right after the other and there should not be anything in between them:
<change>
<project>device_common</project>
<commit_hash>523e077fb8fe899680c33539155d935e0624e40a</commit_hash>
<tree_hash>598e7a1bd070f33b1f1f8c926047edde055094cf</tree_hash>
<parent_hashes>71b1f9be815b72f925e66e866cb7afe9c5cd3239</parent_hashes>
<author_name>Jean-Baptiste Queru</author_name>
<author_e-mail>[email protected]</author_e-mail>
<author_date>Fri Apr 22 08:32:04 2011 -0700</author_date>
<commiter_name>Jean-Baptiste Queru</commiter_name>
<commiter_email>[email protected]</commiter_email>
<committer_date>Fri Apr 22 08:32:04 2011 -0700</committer_date>
<subject>chmod the output scripts</subject>
<message>
<line>Change-Id: Iae22c67066ba4160071aa2b30a5a1052b00a9d7f</line>
</message>
<target>
<line>generate-blob-scripts.sh</line>
</target>
</change>
As it stands, your program is processing all of the XML document, including the data outside the change
elements that you aren't interested in.
If you change the twig_handlers
parameter in your constructor to twig_roots
, then the tree structures will be built for only the elements of interest and the rest will be ignored.
my $xml = XML::Twig->new(twig_roots => { change => \&parseChange });
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With