Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find set difference of two files?

I have two files A and B. I want to find all the lines in A that are not in B. What's the fastest way to do this in bash/using standard linux utilities? Here's what I tried so far:

for line in `cat file1`
 do
   if [ `grep -c "^$line$" file2` -eq 0]; then
   echo $line
   fi
 done

It works, but it's slow. Is there a faster way of doing this?

like image 990
spinlok Avatar asked May 07 '12 21:05

spinlok


2 Answers

The BashFAQ describes doing exactly this with comm, which is the canonically correct method.

# Subtraction of file1 from file2
# (i.e., only the lines unique to file2)
comm -13 <(sort file1) <(sort file2)

diff is less appropriate for this task, as it tries to operate on blocks rather than individual lines; as such, the algorithms it has to use are more complex and less memory-efficient.

comm has been part of the Single Unix Specification since SUS2 (1997).

like image 129
Charles Duffy Avatar answered Oct 24 '22 09:10

Charles Duffy


If you simply want lines that are in file A, but not in B, you can sort the files, and compare them with diff.

sort A > A.sorted
sort B > B.sorted
diff -u A.sorted B.sorted | grep '^-'
like image 36
tonio Avatar answered Oct 24 '22 09:10

tonio