I'm trying to parse a large XML file. I read it using XML::SAX (using Expat, not the perl implementation) and put all the second level and below nodes into my "Node" class:
package Node;
use Moose;
has "name" =>
(
isa => "Str",
reader => 'getName'
);
has "text" =>
(
is => "rw",
isa => "Str"
);
has "attrs" =>
(
is => "rw",
isa => "HashRef[Str]"
);
has "subNodes" =>
(
is => "rw",
isa => "ArrayRef[Node]",
default => sub { [] }
);
sub subNode
{
my ($self, $name) = @_;
my $subNodeRef = $self->subNodes;
my @matchingSubnodes = grep { $_->getName eq $name } @$subNodeRef;
if (scalar(@matchingSubnodes) == 1)
{
return $matchingSubnodes[0];
}
return undef;
}
1;
In the "end_element" sub, I check if this is a node I care about, and if it is, I do some further processing.
This all worked fine on my test files, but the day before yesterday I threw it at my real file, all 13 million lines of it, and it's taking forever. It's been running for over 36 hours. How do I tell if it's Moose or XML::SAX that's the bottleneck? Is Moose always this slow, or am I using it wrong?
Update Doing a profile on a 20,000 line subset of the data shows that it is Moose that's the bottleneck - specifically in Class::MOP::Class::compute_all_applicable_attributes (13.9%) and other Class and Moose classes.
While Moose does quite some work at startup time, which sometimes makes it appear a little slow, the code it generates, especially things like accessors for attributes, are generally quite a bit faster than what the average perl programmer would be able to write. So given the runtime of your process is quite long, I doubt any overhead induced by Moose will be relevant.
However, from the code you've shown, I can't really tell what you're bottleneck is, even though I firmly believe it isn't Moose. I also want to point out that doing __PACKAGE__->meta->make_immutable
, to state that you're class is now "finalised" allows Moose to do some further optimisations, but still I doubt this is what's causing you trouble.
How about you take a smaller sample of your data, so your program will finish in a reasonable time, and have a look at that in a profiler such as Devel::NYTProf
. That'll be able to tell you where exactly the time in your program is spent, so you can optimise specifically those parts to get the greatest possible gain.
One possibility is that the type constraints you're using slow things down quite a bit. Actually validating instance attributes as thoroughly on every single write access to them (or on class instanciation), isn't something most programmers would usually do. You could try using simpler constraints, such as ArrayRef
instead of ArrayRef[Node]
, if you're certain enough about the validity of your data. That way, only the type of the attribute value itself will be checked, not the value of every element in that array reference.
But still, profile your code. Don't guess.
I highly suspect that your speed problem is not in Moose so much as it is in memory allocation and disk swapping. Even without doing ->meta->make_immutable, based on your times for the 20K subset, your script should finish in about 2 hours (((11 * (13_000_000 / 20_000)) / 60) == ~119 min). By doing ->meta->make_immutable it would have cut it down to approx. 65 min or so.
Try running your big script again and see what your memory and swap are doing, I suspect your giving your disk an awful thrashing.
I have successfully written large XML processing apps using XML::Twig 745mb file take less then an hour to run on a reasonably sized box.
But as other users have already mentioned you need to profile your code to figure out what exactly is causing the issue.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With