I want to parse through a 8 GB file to find some information. This is taking me more than 4 hours to finish. I gone through perl Parallel::ForkManager module for this. But it doesn't make much difference. What is the better way to implement this?
The following is the part of the code used to parse this Jumbo file. I actually have list of domains which I have to look in a 8 GB sized zone file and find out what company it is hosted with.
unless(open(FH, $file)) {
print $LOG "Can't open '$file' $!";
die "Can't open '$file' $!";
}
### Reading Zone file : $file
DOMAIN: while(my $line = <FH> ){
#domain and the dns with whom he currently hosted
my($domain, undef, $new_host) = split(/\s|\t/, $line);
next if $seen{$domain};
$seen{$domain} =1;
$domain.=".$domain_type";
$domain = lc ($domain);
#already in?
if($moved_domains->{$domain}){
#Get the next domain if this on the same host, there is nothing to record
if($new_host eq $moved_domains->{$domain}->{PointingHost}){
next DOMAIN;
}
#movedout
else{
@INSERTS = ($domain, $data_date, $new_host, $moved_domains->{$domain}->{Host});
log_this($data_date, $populate, @INSERTS);
}
delete $moved_domains->{$domain};
}
#new to MovedDomain
else{
#is this any of our interested HOSTS
my ($interested) = grep{$new_host =~/\b$_\b/i} keys %HOST;
#if not any of our interested DNS, NEXT!
next DOMAIN if not $interested;
@INSERTS = ($domain, $data_date, $new_host, $HOST{$interested});
log_this($data_date, $populate, @INSERTS);
}
next DOMAIN;
}
A basic line-by-line parsing pass through a 1GB file -- for example, running a regex or something -- takes just a couple of minutes on my 5-year-old Windows box. Even if the parsing work is more extensive, 4 hours sounds like an awfully long time for 8GB of data.
Are you sure that your code does not have a glaring inefficiency? Are you storing a lot of information during the parsing and bumping up against your RAM limits? CPAN has tools that will allow you to profile your code, notably Devel::NYTProf.
Before going through the hassle of parallelizing your code, make sure that you understand where the bottleneck is. If you explain what you are doing or, even better, provide code that illustrates the problem in a compact way, you might get better answers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With