I am trying to merge two tab delimited files files - which are of unequal lengths. I need to merge the files based on column number 1 and get the values from the 3rd column of each file to the new file. If any of the files is missing any id ( uncommon value) then it should get a blank value in the new file -
File1:
id1 2199 082
id2 0909 20909
id3 8002 8030
id4 28080 80828
File2:
id1 988 00808
id2 808 80808
id4 8080 2525
id6 838 3800
Merged file :
id1 082 00808
id2 20909 80808
id3 8030
id4 80828 2525
id6 3800
I went through many forums and posts and so far I have this
awk -F\t 'NR==FNR{A[$1]=$1; B[$1]=$1; next} {$2=A[$1]; $3=B[$1]}1'
but it does not yield the right result, can anyone please suggest. thanks a lot!
$ awk -F'\t' 'NR==FNR{A[$1]=$3; next} {A[$1]; B[$1]=$3} END{for (id in A) print id,A[id],B[id]}' OFS='\t' File1 File2 | sort
id1 082 00808
id2 20909 80808
id3 8030
id4 80828 2525
id6 3800
This script uses two variables. For every line in File1, associative array A
has a key corresponding to the id and the value of the third field. For every id in File2, A
also has a key (but not necessarily a value). For File2, array B
has a key for every id with the corresponding value from the third column.
-F'\t'
This sets the field separator on input to a tab. Note that \t
must be quoted to protect it from the shell.
NR==FNR{A[$1]=$3; next}
This sets the associative array A
for the first file.
A[$1]; B[$1]=$3
This sets associative array for the second file. It also makes sure that the array A
has a key for every id in file2.
END{for (id in A) print id,A[id],B[id]}
This prints out the results.
OFS='\t'
This sets the output field separator to a tab.
sort
The awk construct for key in array
is not guaranteed to return the keys in any particular order. We use sort
to sort the output into ascending order in the id.
Assuming no repeated IDs, you could try (Bash used at -F$'\t'
):
awk -F$'\t' 'BEGIN { OFS=FS } NR==FNR { A[$1] = $3; next }
{ if ($1 in A) print $1, A[$1], $3
else print $1, " ", $3
delete A[$1]
}
END { for (i in A) print i, A[i], " " }
' File1 File2
Set the output field separator. For lines in the first file, capture field 3 of the first file in the array A
indexed by column 1. For lines in the second (or subsequent) files, if the ID column is found in A
, print the three columns; otherwise, print a blank in place of the missing entry in A
. Delete the entries in A
after they've been used. At the end, any rows left over are printed with a blank for the missing entry in the second file.
For the given data, an example of the output is:
id1 082 00808
id2 20909 80808
id4 80828 2525
id6 3800
id3 8030
Clearly, if you want the data sorted in a particular way, you can post-filter the awk
command with the sort
command (meaning, pipe the output from awk
to sort
).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With