Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimize Perl Script to Correlate Records Between Two Files

Tags:

perl

open( FH, 'MAH' ) or die "$!";
while ( $lines = <FH> ) {
    $SSA = substr( $lines, 194, 9 );
    open( FH1, 'MAH2' ) or die "$!";
    while ( $array1 = <FH1> ) {
        @fieldnames = split( /\|/, $array1 );
        $SSA1       = $fieldnames[1];
        $report4    = $fieldnames[0];
        if ( $SSA =~ /$SSA1/ ) {
            $report5= $report4;
        }
    }
}

I am trying to extract the "SSA" value from MAH file and search in MAH2 file for that value. If found, return the "report4" value. I am able get the output but it is taking lot of time to process. Is there any way to optimize the code so it finishes quickly?

Each of my files has 300,000 records and file sizes are 15 MB. Currently it is taking 5 hours to process

like image 815
Akhil Avatar asked Mar 05 '21 14:03

Akhil


2 Answers

Build a lookup table.

my $foo_qfn = 'MAH';
my $bar_qfn = 'MAH2';

my %foos;
{
   open(my $fh, '<', $foo_qfn)
      or die("Can't open \"$foo_qfn\": $!\n");

   while ( my $foo_line = <$fh> ) {
      my $ssa = substr($foo_line, 194, 9);
      $foos{$ssa} = $foo_line;
   }
}

{
   open(my $fh, '<', $bar_qfn)
      or die("Can't open \"$bar_qfn\": $!\n");

   while ( my $bar_line = <$fh> ) {
      chomp($bar_line);
      my ($report4, $ssa) = split(/\|/, $bar_line);
      my $foo_line = $foos{$ssa};
      ...
   }
}

Your original code took time indirectly proportional to the number of foos times the number of bars (O(N*M)).

This will take time indirectly proportional to the largest of the number of foos and the number of bars (O(N+M)).

In other words, this should be over 100,000 times faster. We're talking seconds, not hours.

like image 114
ikegami Avatar answered Oct 16 '22 22:10

ikegami


If your task is just to find the records in file2 that correspond to records in file1 by the SSA field, there is another way to do it that might be faster and simpler than the classic lookup hash table approach.

You can use a regular expression constructed from the records in file1 to parse, match, and extract from file2 in one pass. Yes, Perl can handle regexes with 300,000 alternations! :) This is only reasonable in Perl's whose regex engines can construct alternation trees. (5.10+ You could use Regexp::Assemble before that.)

## YOUR CODE ##
open( FH, 'MAH' ) or die "$!";
while ( $lines = <FH> ) {
    $SSA = substr( $lines, 194, 9 );
    open( FH1, 'MAH2' ) or die "$!";
    while ( $array1 = <FH1> ) {
        @fieldnames = split( /\|/, $array1 );
        $SSA1       = $fieldnames[1];
        $report4    = $fieldnames[0];
        if ( $SSA =~ /$SSA1/ ) {
            $report5= $report4;
        }
    }
}

As regex:

our $file1 = "MAH";
our $file2 = "MAH2";

open our $fh1, "<", $file1 or die $!;
our $ssa_regex = "(?|" . 
    join( "|", 
      map join("", "^([^|]*)[|](", quotemeta($_), ")(?=[|])"), 
      map substr( $_, 194, 9 ), 
      <$fh1> ) .
    ")"
;
close $fh1;

open our $fh2, "<", $file2 or die $!;
our @ssa_matches = do { local $/; <$fh2> =~ m/$ssa_regex/mg; };
close $fh2;
undef $ssa_regex;
die "match array contains an odd number of entries??\n" if @ssa_matches % 2;

while (@ssa_matches) {
  my($report4, $SSA1) = splice @ssa_matches, 0, 2;
  ## do whatever with this information ##

}

Let's break that up with some comments.

Read file1 and build the regex.

our $file1 = "MAH";
our $file2 = "MAH2";

# open file1 as normal
open our $fh1, "<", $file1 or die $!;
# build up a regular expressions that will match all of the SSA fields
our $ssa_regex = 
   # Start the alternation reset group.  This way you always have $1 
   # and $2 regardless of how many groups or total parens there are.
   "(?|" . 
   # Join all the alternations together
    join( "|", 
      # Create one regex group that will match the beginning of the line, 
      # the first "record4" field, the | delimiter, the SSA, and then 
      # make sure the following character is the delimiter.  [|] is 
      # another way to escape the | character that can be more clear 
      # than \|.
      # Escape any weird characters in the SSA with quotemeta(). Omit 
      # this if plain text.
      map join("", "^([^|]*)[|](", quotemeta($_), ")(?=[|])"), 
      # Pull out the SSA value with substr().
      map substr( $_, 194, 9 ), 
      # Read all the lines of file1 and feed them into the map pipeline.
      <$fh1> ) .
    # Add the closing parethesis for the alternation reset group.
    ")"
;
# Close file1.
close $fh1;

Read in file2 and apply the regex.

# Open file2 as normal.
open our $fh2, "<", $file2 or die $!;
# Read all of file2 and apply the regex to get an array of the wanted
# "record4" field and the matching SSA.
our @ssa_matches = 
# Using a do{} block lets do the undef inline.
do { 
# Undefine $/ which is the input record seperator which will let 
# us read the entire file as a single string.
local $/; 
# Read the file as a single string and apply the regex, doing a global 
# multiline match.  /m means to apply the ^ assertion at every line, 
# not just at the beginning of the string.  /g means to perform and 
# return all of the matches at once.
<$fh2> =~ m/$ssa_regex/mg;
};
# Close file2 as normal.
close $fh2;
# Clear the memory for the regex if we don't need it anymore
undef $ssa_regex;

# Make sure we got pairs
die "match array contains an odd number of entries??\n" if @ssa_matches % 2;

# Now just iterate through @ssa_matches two at a time to do whatever
# you wanted to do with the matched SSA values and that "record4" 
# field.  Why is it record4 if it's the first field?
while (@ssa_matches) {
  # Use splice() to pull out and remove the two values from @ssa_matches
  my($report4, $SSA1) = splice @ssa_matches, 0, 2;
  ## do whatever with this information ##

}

The regex could be compacted a bit more if we're being pedantic.

our $ssa_regex = "^([^|]*)[|](" . 
    join( "|", 
      map quotemeta($_), 
      map substr( $_, 194, 9 ), 
      <$fh1> ) .
    ")(?=[|])"
;

I'm not guaranteeing this way is better or faster than any other, but it is a way to do it with fewer steps.

like image 37
lordadmira Avatar answered Oct 16 '22 23:10

lordadmira