properly join two files based on 2 columns in common

Q: How do I combine the contents of two files in Linux?

Type the cat command followed by the file or files you want to add to the end of an existing file. Then, type two output redirection symbols ( >> ) followed by the name of the existing file you want to add to.

Tags:

join

unix

awk

I have two files I'm trying to join/merge based on columns 1 and 2. They look something like this, with file1 (58210 lines) being much shorter than file2 (815530 lines) and I'd like to find the intersection of these two files based on fields 1 and 2 as an index:

file1:

2L      25753   33158
2L      28813   33158
2L      31003   33158
2L      31077   33161
2L      31279   33161
3L      32124   45339
3L      33256   45339
...

file2:

2L      20242   0.5     0.307692307692308
2L      22141   0.32258064516129        0.692307692307692
2L      24439   0.413793103448276       0.625
2L      24710   0.371428571428571       0.631578947368421
2L      25753   0.967741935483871       0.869565217391304
2L      28813   0.181818181818182       0.692307692307692
2L      31003   0.36    0.666666666666667
2L      31077   0.611111111111111       0.931034482758621
2L      31279   0.75    1
3L      32124   0.558823529411765       0.857142857142857
3L      33256   0.769230769230769       0.90625
...

I've been using the following couple of commands but end up with different numbers of lines:

awk 'FNR==NR{a[$1$2]=$3;next} {if($1$2 in a) print}' file1 file2 | wc -l
awk 'FNR==NR{a[$1$2]=$3;next} {if($1$2 in a) print}' file2 file1 | wc -l

I'm not sure why this happens, and I've tried sorting prior to comparison, just in case I have duplicate lines (based on columns 1 and 2) in either of the files, but it doesn't seem to help. (Any insights on why this is so are also appreciated)

How can I just merge the files so that just the lines of file2 that have the corresponding columns 1 and 2 in file1 get printed, with column 3 of file1 added on, to look something like this:

2L      25753   0.967741935483871       0.869565217391304    33158
2L      28813   0.181818181818182       0.692307692307692    33158
2L      31003   0.36    0.666666666666667    33158
2L      31077   0.611111111111111       0.931034482758621    33161
2L      31279   0.75    1    33161
3L      32124   0.558823529411765       0.857142857142857    45339
3L      33256   0.769230769230769       0.90625    45339

618

asked Oct 21 '12 02:10

suegene

1 Answers

awk 'NR==FNR{a[$1,$2]=$3;next} ($1,$2) in a{print $0, a[$1,$2]}' file1 file2

Look:

$ cat file1
2L      5753   33158
2L      8813   33158
2L      7885   33159
2L      1279   33159
2L      5095   33158
$
$ cat file2
2L      8813    0.6    1.2
2L      5762    0.4    0.5
2L      1279    0.5    0.9
$
$ awk 'NR==FNR{a[$1,$2]=$3;next} ($1,$2) in a{print $0, a[$1,$2]}' file1 file2
2L      8813    0.6    1.2 33158
2L      1279    0.5    0.9 33159
$

If that's not what you want, please clarify and perhaps post some more representative sample input/output.

Commented version of the above code to provide requested explanation:

awk ' # START SCRIPT

# IF the number of records read so far across all files is equal
#    to the number of records read so far in the current file, a
#    condition which can only be true for the first file read, THEN 
NR==FNR {

   # populate array "a" such that the value indexed by the first
   # 2 fields from this record in file1 is the value of the third
   # field from the first file.
   a[$1,$2]=$3

   # Move on to the next record so we don't do any processing intended
   # for records from the second file. This is like an "else" for the
   # NR==FNR condition.
   next

} # END THEN

# We only reach this part of the code if the above condition is false,
# i.e. if the current record is from file2, not from file1.

# IF the array index constructed from the first 2 fields of the current
#    record exist in array a, as would occur if these same values existed
#    in file1, THEN
($1,$2) in a {

   # print the current record from file2 followed by the value from file1
   # that occurred at field 3 of the record that had the same values for
   # field 1 and field 2 in file1 as the current record from file2.
   print $0, a[$1,$2]

} # END THEN

' file1 file2 # END SCRIPT

Hope that helps.

175

answered Sep 28 '22 02:09

Ed Morton

Related questions
                            
                                What's the difference between ${varname} and $varname in a shell scripts
                            
                                What is the best unix compression utility? [closed]
                            
                                unix map function
                            
                                Check running processes in C
                            
                                UNIX 'ls' command - wildcard 'OR' operator
                            
                                Unix sockets slower than tcp when connected to redis
                            
                                How to substitute without creating intermediate file in sed?
                            
                                How to remove non-ascii chars using sed
                            
                                AngularJS wrong unix time parse result
                            
                                Get file inode in Go
                            
                                Regarding Hard Link
                            
                                JDBC MySQL connection using Unix Socket
                            
                                How do I convert LF to CRLF?
                            
                                How to split a huge csv file based on content of first column?
                            
                                Parallel processing or threading in Shell scripting
                            
                                Finding contents of one file in another file
                            
                                What is the ASCII Code of ½?
                            
                                Recursive erase hidden files
                            
                                How to get the file diff between two S3 buckets?
                            
                                Command not found in Bash's IF-ELSE condition when using [! -d "$DIR"]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With