I have two files I'm trying to join/merge based on columns 1
and 2
. They look something like this, with file1
(58210
lines) being much shorter than file2
(815530
lines) and I'd like to find the intersection of these two files based on fields 1
and 2
as an index:
file1
:
2L 25753 33158
2L 28813 33158
2L 31003 33158
2L 31077 33161
2L 31279 33161
3L 32124 45339
3L 33256 45339
...
file2
:
2L 20242 0.5 0.307692307692308
2L 22141 0.32258064516129 0.692307692307692
2L 24439 0.413793103448276 0.625
2L 24710 0.371428571428571 0.631578947368421
2L 25753 0.967741935483871 0.869565217391304
2L 28813 0.181818181818182 0.692307692307692
2L 31003 0.36 0.666666666666667
2L 31077 0.611111111111111 0.931034482758621
2L 31279 0.75 1
3L 32124 0.558823529411765 0.857142857142857
3L 33256 0.769230769230769 0.90625
...
I've been using the following couple of commands but end up with different numbers of lines:
awk 'FNR==NR{a[$1$2]=$3;next} {if($1$2 in a) print}' file1 file2 | wc -l
awk 'FNR==NR{a[$1$2]=$3;next} {if($1$2 in a) print}' file2 file1 | wc -l
I'm not sure why this happens, and I've tried sorting prior to comparison, just in case I have duplicate lines (based on columns 1
and 2
) in either of the files, but it doesn't seem to help. (Any insights on why this is so are also appreciated)
How can I just merge the files so that just the lines of file2
that have the corresponding columns 1
and 2
in file1
get printed, with column 3
of file1
added on, to look something like this:
2L 25753 0.967741935483871 0.869565217391304 33158
2L 28813 0.181818181818182 0.692307692307692 33158
2L 31003 0.36 0.666666666666667 33158
2L 31077 0.611111111111111 0.931034482758621 33161
2L 31279 0.75 1 33161
3L 32124 0.558823529411765 0.857142857142857 45339
3L 33256 0.769230769230769 0.90625 45339
NOTE : When using join command, both the input files should be sorted on the KEY on which we are going to join the files. So, the output contains the key followed by all the matching columns from the first file file1. txt, followed by all the columns of second file file2.
Using wc to get the column width, and sed to right pad with a visible character . (just for this example)... and then paste to join the two columns with a Tab char... If you want to pad out the right column: paste <( sed -re :a -e 's/^.
Type the cat command followed by the file or files you want to add to the end of an existing file. Then, type two output redirection symbols ( >> ) followed by the name of the existing file you want to add to.
The join command reads the files specified by the File1 and File2 parameters, joins lines in the files according to the flags, and writes the results to standard output. The File1 and File2 parameters must be text files.
awk 'NR==FNR{a[$1,$2]=$3;next} ($1,$2) in a{print $0, a[$1,$2]}' file1 file2
Look:
$ cat file1
2L 5753 33158
2L 8813 33158
2L 7885 33159
2L 1279 33159
2L 5095 33158
$
$ cat file2
2L 8813 0.6 1.2
2L 5762 0.4 0.5
2L 1279 0.5 0.9
$
$ awk 'NR==FNR{a[$1,$2]=$3;next} ($1,$2) in a{print $0, a[$1,$2]}' file1 file2
2L 8813 0.6 1.2 33158
2L 1279 0.5 0.9 33159
$
If that's not what you want, please clarify and perhaps post some more representative sample input/output.
Commented version of the above code to provide requested explanation:
awk ' # START SCRIPT
# IF the number of records read so far across all files is equal
# to the number of records read so far in the current file, a
# condition which can only be true for the first file read, THEN
NR==FNR {
# populate array "a" such that the value indexed by the first
# 2 fields from this record in file1 is the value of the third
# field from the first file.
a[$1,$2]=$3
# Move on to the next record so we don't do any processing intended
# for records from the second file. This is like an "else" for the
# NR==FNR condition.
next
} # END THEN
# We only reach this part of the code if the above condition is false,
# i.e. if the current record is from file2, not from file1.
# IF the array index constructed from the first 2 fields of the current
# record exist in array a, as would occur if these same values existed
# in file1, THEN
($1,$2) in a {
# print the current record from file2 followed by the value from file1
# that occurred at field 3 of the record that had the same values for
# field 1 and field 2 in file1 as the current record from file2.
print $0, a[$1,$2]
} # END THEN
' file1 file2 # END SCRIPT
Hope that helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With