Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Intersection of files

I two large files (27k lines and 450k lines). They look sort of like:

File1:
1 2 A 5
3 2 B 7
6 3 C 8
...

File2:
4 2 C 5
7 2 B 7
6 8 B 8
7 7 F 9
... 

I want the lines from both files in which the 3rd column is in both files (note lines with A and F were excluded):

OUTPUT:
3 2 B 7
6 3 C 8
4 2 C 5
7 2 B 7
6 8 B 8

whats the best way?

like image 315
bdeonovic Avatar asked Sep 15 '12 22:09

bdeonovic


People also ask

How do you find the union of two files in Linux?

To form the union of two files as multisets of lines, just combine them into one file, with duplicates. You can join file1 and file2 with cat (short for “concatenate”). To find the union of two files as sets, first find the union as multisets, then remove duplicates.

How do you find the common lines between two files in UNIX?

Use comm -12 file1 file2 to get common lines in both files. You may also needs your file to be sorted to comm to work as expected. Or using grep command you need to add -x option to match the whole line as a matching pattern. The F option is telling grep that match pattern as a string not a regex match.


2 Answers

first we sort the files on the third field :

sort -k 3 file1 > file1.sorted
sort -k 3 file2 > file2.sorted

then we get common values on the 3rd field using comm :

comm -12 <(cut -d " " -f 3 file1.sorted | uniq) <(cut -d " " -f 3 file2.sorted | uniq) > common_values.field

now we can join each sorted file on the common values :

join -1 3 -o '1.1,1.2,1.3,1.4' file1.sorted common_values.field > file.joined
join -1 3 -o '1.1,1.2,1.3,1.4' file2.sorted common_values.field >> file.joined

output is formated so we get the same field order as the one used in the files. Standard unix tools used : sort, comm, cut, uniq, join. The <( ) works with bash, for other shells you might use temp files instead.

like image 110
Kwariz Avatar answered Sep 28 '22 02:09

Kwariz


Here's an option using grep, sed and cut.

Extract column 3:

cut -d' ' -f3 file1 > f1c
cut -d' ' -f3 file2 > f2c

Find matching lines in file1:

grep -nFf f2c f1c | cut -d: -f1 | sed 's/$/p/' | sed -n -f - file1  > out

Find matching lines in file2:

grep -nFf f1c f2c | cut -d: -f1 | sed 's/$/p/' | sed -n -f - file2 >> out

Output:

3 2 B 7
6 3 C 8
4 2 C 5
7 2 B 7
6 8 B 8

Update

If you have asymmetric data files and the smaller one fits into memory, this one-pass awk solution would be pretty efficient:

parse.awk

FNR == NR {
  a[$3] = $0
  p[$3] = 1
  next
}  

a[$3]

p[$3] {
  print a[$3]
  delete p[$3]
}

Run it like this:

awk -f parse.awk file1 file2

Where file1 is the smaller of the two.

Explanation

  • The FNR == NR block reads file1 into two hashes.
  • a[$3] prints file2 line if $3 is a key in a.
  • p[$3] prints file1 line if $3 is a key in p and deletes the key (only print once).
like image 33
Thor Avatar answered Sep 28 '22 01:09

Thor