Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

merge two files based on one column Awk

Tags:

awk

I am trying to merge two tab delimited files files - which are of unequal lengths. I need to merge the files based on column number 1 and get the values from the 3rd column of each file to the new file. If any of the files is missing any id ( uncommon value) then it should get a blank value in the new file -

File1: 
id1 2199 082
id2 0909 20909
id3 8002 8030
id4 28080 80828

File2:

id1 988 00808
id2 808 80808
id4 8080 2525
id6 838 3800

Merged file :

id1 082 00808
id2 20909 80808
id3 8030  
id4 80828 2525
id6   3800

I went through many forums and posts and so far I have this

awk -F\t 'NR==FNR{A[$1]=$1; B[$1]=$1; next} {$2=A[$1]; $3=B[$1]}1'

but it does not yield the right result, can anyone please suggest. thanks a lot!

like image 582
AnkP Avatar asked Dec 08 '22 22:12

AnkP


2 Answers

$ awk -F'\t' 'NR==FNR{A[$1]=$3; next} {A[$1]; B[$1]=$3} END{for (id in A) print id,A[id],B[id]}' OFS='\t' File1 File2 | sort
id1     082     00808
id2     20909   80808
id3     8030
id4     80828   2525
id6             3800

How it works

This script uses two variables. For every line in File1, associative array A has a key corresponding to the id and the value of the third field. For every id in File2, A also has a key (but not necessarily a value). For File2, array B has a key for every id with the corresponding value from the third column.

  • -F'\t'

    This sets the field separator on input to a tab. Note that \t must be quoted to protect it from the shell.

  • NR==FNR{A[$1]=$3; next}

    This sets the associative array A for the first file.

  • A[$1]; B[$1]=$3

    This sets associative array for the second file. It also makes sure that the array A has a key for every id in file2.

  • END{for (id in A) print id,A[id],B[id]}

    This prints out the results.

  • OFS='\t'

    This sets the output field separator to a tab.

  • sort

    The awk construct for key in array is not guaranteed to return the keys in any particular order. We use sort to sort the output into ascending order in the id.

like image 78
John1024 Avatar answered Dec 10 '22 11:12

John1024


Assuming no repeated IDs, you could try (Bash used at -F$'\t'):

awk -F$'\t' 'BEGIN { OFS=FS } NR==FNR { A[$1] = $3; next }
             { if ($1 in A) print $1, A[$1], $3
               else         print $1, " ", $3
               delete A[$1]
             }
             END { for (i in A) print i, A[i], " " }
            ' File1 File2

Set the output field separator. For lines in the first file, capture field 3 of the first file in the array A indexed by column 1. For lines in the second (or subsequent) files, if the ID column is found in A, print the three columns; otherwise, print a blank in place of the missing entry in A. Delete the entries in A after they've been used. At the end, any rows left over are printed with a blank for the missing entry in the second file.

For the given data, an example of the output is:

id1     082     00808
id2     20909   80808
id4     80828   2525
id6             3800
id3     8030

Clearly, if you want the data sorted in a particular way, you can post-filter the awk command with the sort command (meaning, pipe the output from awk to sort).

like image 25
Jonathan Leffler Avatar answered Dec 10 '22 12:12

Jonathan Leffler