Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast way of finding lines in one file that are not in another?

Tags:

grep

find

bash

diff

I have two large files (sets of filenames). Roughly 30.000 lines in each file. I am trying to find a fast way of finding lines in file1 that are not present in file2.

For example, if this is file1:

line1 line2 line3 

And this is file2:

line1 line4 line5 

Then my result/output should be:

line2 line3 

This works:

grep -v -f file2 file1

But it is very, very slow when used on my large files.

I suspect there is a good way to do this using diff(), but the output should be just the lines, nothing else, and I cannot seem to find a switch for that.

Can anyone help me find a fast way of doing this, using bash and basic Linux binaries?

EDIT: To follow up on my own question, this is the best way I have found so far using diff():

 diff file2 file1 | grep '^>' | sed 's/^>\ //' 

Surely, there must be a better way?

like image 640
Niels2000 Avatar asked Aug 13 '13 09:08

Niels2000


People also ask

How do I find the common line between two files?

Use comm -12 file1 file2 to get common lines in both files. You may also needs your file to be sorted to comm to work as expected. Or using grep command you need to add -x option to match the whole line as a matching pattern. The F option is telling grep that match pattern as a string not a regex match.

How do I grep certain lines in a file?

The grep command searches through the file, looking for matches to the pattern specified. To use it type grep , then the pattern we're searching for and finally the name of the file (or files) we're searching in. The output is the three lines in the file that contain the letters 'not'.

How can you tell that two files are different from each other?

Probably the easiest way to compare two files is to use the diff command. The output will show you the differences between the two files. The < and > signs indicate whether the extra lines are in the first (<) or second (>) file provided as arguments.


1 Answers

The comm command (short for "common") may be useful comm - compare two sorted files line by line

#find lines only in file1 comm -23 file1 file2   #find lines only in file2 comm -13 file1 file2   #find lines common to both files comm -12 file1 file2  

The man file is actually quite readable for this.

like image 146
JnBrymn Avatar answered Nov 01 '22 20:11

JnBrymn