I have two (very large) text files. What is the fastest way - in terms of run time - to create a third file containing all lines of file1 that do not appear in file2?
So if file1 contains:
Sally Joe Tom Suzie
And file2 contains:
Sally Suzie Harry Tom
Then the output file should contain:
Joe
Create a hashmap containing each line from file 2. Then for each line in file 1, if it is not in the hashmap then output it. This will be O(N), which is the best efficiency class you can achieve given that you have to read the input.
Perl implementation:
#!/usr/bin/env perl
use warnings;
use strict;
use Carp ();
my $file1 = 'file1.txt';
my $file2 = 'file2.txt';
my %map;
{
open my $in, '<',$file2 or Carp::croak("Cant open $file2");
while (<$in>) {
$map{$_} = 1;
}
close($in) or Carp::carp("error closing $file2");
}
{
open my $in,'<', $file1 or Carp::croak("Cant open $file1");
while (<$in>) {
if (!$map{$_}) {
print $_;
}
}
close $in or Carp::carp("error closing $file1");
}
If file 2 is so large that the hashmap doesn't fit in memory, then we have a different problem at hand. A possible solution is then to use the above solution on chunks of file 2 (small enough to fit into memory), outputing the results to temporary files. Provided there are sufficient matches between file 1 and file 2, then total output should be of reasonable size. To calculate the final results, we perform an intersection of the lines in temporary files, i.e. for a line to be in the final results it must occur in every temporary file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With