Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallel computing in Perl

Tags:

perl

I want to parse through a 8 GB file to find some information. This is taking me more than 4 hours to finish. I gone through perl Parallel::ForkManager module for this. But it doesn't make much difference. What is the better way to implement this?

The following is the part of the code used to parse this Jumbo file. I actually have list of domains which I have to look in a 8 GB sized zone file and find out what company it is hosted with.

    unless(open(FH, $file)) {
        print $LOG "Can't open '$file'  $!";
        die "Can't open '$file'  $!";
    }

    ### Reading Zone file : $file
    DOMAIN: while(my $line = <FH> ){

        #domain and the dns with whom he currently hosted
        my($domain, undef, $new_host) = split(/\s|\t/, $line);
        next if $seen{$domain};
        $seen{$domain} =1;

        $domain.=".$domain_type";
        $domain = lc ($domain);


        #already in?
        if($moved_domains->{$domain}){

            #Get the next domain if this on the same host, there is nothing to record 
            if($new_host eq $moved_domains->{$domain}->{PointingHost}){
                next DOMAIN;
            }
            #movedout
            else{
                @INSERTS = ($domain, $data_date, $new_host, $moved_domains->{$domain}->{Host});
                log_this($data_date, $populate, @INSERTS);
            }
            delete $moved_domains->{$domain};
        }
        #new to MovedDomain
        else{
            #is this any of our interested HOSTS
            my ($interested) = grep{$new_host =~/\b$_\b/i} keys %HOST;

            #if not any of our interested DNS, NEXT!
            next DOMAIN if not $interested;
            @INSERTS = ($domain, $data_date, $new_host, $HOST{$interested});
            log_this($data_date, $populate, @INSERTS);

        }
        next DOMAIN;

    }
like image 910
arshad Avatar asked Jan 25 '26 11:01

arshad


1 Answers

A basic line-by-line parsing pass through a 1GB file -- for example, running a regex or something -- takes just a couple of minutes on my 5-year-old Windows box. Even if the parsing work is more extensive, 4 hours sounds like an awfully long time for 8GB of data.

Are you sure that your code does not have a glaring inefficiency? Are you storing a lot of information during the parsing and bumping up against your RAM limits? CPAN has tools that will allow you to profile your code, notably Devel::NYTProf.

Before going through the hassle of parallelizing your code, make sure that you understand where the bottleneck is. If you explain what you are doing or, even better, provide code that illustrates the problem in a compact way, you might get better answers.

like image 50
FMc Avatar answered Jan 28 '26 06:01

FMc