Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the fastest way in Perl to get all lines of file1 that do not appear in file2?

Tags:

algorithm

perl

I have two (very large) text files. What is the fastest way - in terms of run time - to create a third file containing all lines of file1 that do not appear in file2?

So if file1 contains:

Sally  
Joe  
Tom  
Suzie

And file2 contains:

Sally  
Suzie  
Harry  
Tom

Then the output file should contain:

Joe
like image 386
Rini Avatar asked Nov 28 '22 05:11

Rini


1 Answers

Create a hashmap containing each line from file 2. Then for each line in file 1, if it is not in the hashmap then output it. This will be O(N), which is the best efficiency class you can achieve given that you have to read the input.

Perl implementation:

#!/usr/bin/env perl
use warnings;
use strict;
use Carp ();

my $file1 = 'file1.txt';
my $file2 = 'file2.txt';

my %map;
{
    open my $in, '<',$file2 or Carp::croak("Cant open $file2");
    while (<$in>) {
      $map{$_} = 1;
    }
    close($in) or Carp::carp("error closing $file2");
}
{
   open my $in,'<', $file1 or Carp::croak("Cant open $file1");
   while (<$in>) {
    if (!$map{$_}) {
      print $_;
    }
   }
   close $in or Carp::carp("error closing $file1");
}

If file 2 is so large that the hashmap doesn't fit in memory, then we have a different problem at hand. A possible solution is then to use the above solution on chunks of file 2 (small enough to fit into memory), outputing the results to temporary files. Provided there are sufficient matches between file 1 and file 2, then total output should be of reasonable size. To calculate the final results, we perform an intersection of the lines in temporary files, i.e. for a line to be in the final results it must occur in every temporary file.

like image 125
moinudin Avatar answered May 26 '23 23:05

moinudin