I two large files (27k lines and 450k lines). They look sort of like:
File1:
1 2 A 5
3 2 B 7
6 3 C 8
...
File2:
4 2 C 5
7 2 B 7
6 8 B 8
7 7 F 9
...
I want the lines from both files in which the 3rd column is in both files (note lines with A and F were excluded):
OUTPUT:
3 2 B 7
6 3 C 8
4 2 C 5
7 2 B 7
6 8 B 8
whats the best way?
To form the union of two files as multisets of lines, just combine them into one file, with duplicates. You can join file1 and file2 with cat (short for “concatenate”). To find the union of two files as sets, first find the union as multisets, then remove duplicates.
Use comm -12 file1 file2 to get common lines in both files. You may also needs your file to be sorted to comm to work as expected. Or using grep command you need to add -x option to match the whole line as a matching pattern. The F option is telling grep that match pattern as a string not a regex match.
first we sort the files on the third field :
sort -k 3 file1 > file1.sorted
sort -k 3 file2 > file2.sorted
then we get common values on the 3rd field using comm :
comm -12 <(cut -d " " -f 3 file1.sorted | uniq) <(cut -d " " -f 3 file2.sorted | uniq) > common_values.field
now we can join each sorted file on the common values :
join -1 3 -o '1.1,1.2,1.3,1.4' file1.sorted common_values.field > file.joined
join -1 3 -o '1.1,1.2,1.3,1.4' file2.sorted common_values.field >> file.joined
output is formated so we get the same field order as the one used in the files.
Standard unix tools used : sort, comm, cut, uniq, join.
The <( )
works with bash, for other shells you might use temp files instead.
Here's an option using grep, sed and cut.
Extract column 3:
cut -d' ' -f3 file1 > f1c
cut -d' ' -f3 file2 > f2c
Find matching lines in file1
:
grep -nFf f2c f1c | cut -d: -f1 | sed 's/$/p/' | sed -n -f - file1 > out
Find matching lines in file2
:
grep -nFf f1c f2c | cut -d: -f1 | sed 's/$/p/' | sed -n -f - file2 >> out
Output:
3 2 B 7
6 3 C 8
4 2 C 5
7 2 B 7
6 8 B 8
If you have asymmetric data files and the smaller one fits into memory, this one-pass awk solution would be pretty efficient:
parse.awk
FNR == NR {
a[$3] = $0
p[$3] = 1
next
}
a[$3]
p[$3] {
print a[$3]
delete p[$3]
}
Run it like this:
awk -f parse.awk file1 file2
Where file1
is the smaller of the two.
Explanation
FNR == NR
block reads file1
into two hashes.a[$3]
prints file2
line if $3
is a key in a
.p[$3]
prints file1
line if $3
is a key in p
and deletes the key (only print once).If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With