Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way of comparing two files and removing partial match

Tags:

file

linux

awk

perl

I have two files, example:

File1:

partial
line3
someline2

File2:

this is line3
this is partial
typo artial
someline2
someline

Requirement:

  1. Delete all lines from file2 which contains any line from file1.
  2. There must be a partial match, line from file1 found in file2 (not full line match).
  3. I am looking for the most efficient way, I am comparing files of millions of lines.
  4. It can be achieved with any tool/language on linux.

Expected result:

typo artial
someline

I tested with python but it is extremely slow. Also tested with grep and it is nearly as slow as python.

The files I am comparing can have up to 10GB in size. Memory on server side is not an issue but I would like not to waste resources.

Testing results based on answers:
Files used for testing:

  • file1 with 7051 lines
  • file2 with 2182387 lines

Using grep:

# time grep -v -f file1 file2 > file3
real    28m50.078s
user    27m13.984s
sys     1m36.068s
# wc -l file3
1947790 file3

Grep with -F:

# time grep -v -F -f file1 file2 > file3
real    0m1.441s
user    0m1.400s
sys     0m0.040s
# wc -l file3
1950655 file3

Using perl posted by Borodin:

# time ./clean.pl > file3
real    0m2.281s
user    0m2.176s
sys     0m0.104s
# wc -l file3
1950655 file3

To be honest I did not expect fixed strings to make such a big difference for grep. So far grep wins this, will have to test with 10GB files and time it. After make sure the results are correct. Will be back with an update.

Update

Perl wins this one since I had to introduce some regex for some special cases. For instance I have a big file with domains and I want to exclude those from another file. But that means that I need domain$ as regex, otherwise google.co would match google.com and it is not ok. If you do not have that special case as I had for some files only, grep is the obvious performance winner.

like image 209
zerg . Avatar asked Jan 20 '26 11:01

zerg .


2 Answers

I would like to use grep function on linux system

command

grep -v -f File1 File2

-v : select non-matching lines

-f : obtain PATTERN from FILE

your need run the above command on the terminal

output

typo artial
someline
like image 191
Arijit Panda Avatar answered Jan 23 '26 02:01

Arijit Panda


The simplest way is to build a regex pattern from all of the strings in file1.txt, and print only those files in file2.txt that don't match the pattern

use strict;
use warnings 'all';

my $re = do {
    open my $fh, '<', 'file1.txt' or die $!;
    my @data = <$fh>;
    chomp @data;
    my $re = join '|', map quotemeta($_), @data;
    qr/$re/;
};

open my $fh, '<', 'file2.txt' or die $!;
/$re/ or print while <$fh>;

output

typo artial
someline
like image 35
Borodin Avatar answered Jan 23 '26 02:01

Borodin



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!