How can I do it?
File1 looks like this:
foo 1 scaf 3
bar 2 scaf 3.3
File2 looks like this:
foo 1 scaf 4.5
foo 1 boo 2.3
bar 2 scaf 1.00
What I want to do is to find lines that co-occur in File1 and File2 when fields 1,2, and 3 are the same.
Is there a way to do it?
Type the cat command followed by the file or files you want to add to the end of an existing file. Then, type two output redirection symbols ( >> ) followed by the name of the existing file you want to add to.
Replace file1 , file2 , and file3 with the names of the files you wish to combine, in the order you want them to appear in the combined document. Replace newfile with a name for your newly combined single file.
NOTE : When using join command, both the input files should be sorted on the KEY on which we are going to join the files. So, the output contains the key followed by all the matching columns from the first file file1. txt, followed by all the columns of second file file2.
To join two files using the join command files must have identical join fields. The default join field is the first field delimited by blanks.
Here is the correct answer (in terms of using standard GNU coreutils tools, and not writing custom script in perl/awk you name it).
$ join -j1 -o1.2,1.3,1.4,1.5,2.5 <(<file1 awk '{print $1"-"$2"-"$3" "$0}' | sort -k1,1) <(<file2 awk '{print $1"-"$2"-"$3" "$0}' | sort -k1,1)
bar 2 scaf 3.3 1.00
foo 1 scaf 3 4.5
OK, how does it work:
First of all we will use a great tool join
which can merge two lines. join
has two requirements:
We need to generate keys in input files and for that we use a simple awk
script:
$ cat file1
foo 1 scaf 3
bar 2 scaf 3.3
$ <file1 awk '{print $1"-"$2"-"$3" "$0}'
foo-1-scaf foo 1 scaf 3
bar-2-scaf bar 2 scaf 3.3
You see, we added 1st column with some key like "foo-1-scaf".
We do the same with file2.
BTW. <file awk
, is just fancy way of writing awk file
, or cat file | awk
.
We also should sort our files by the key, in our case this is column 1, so we add
to the end of the command the | sort -k1,1
(sort by text from column 1 to column 1)
At this point we could just generate files file1.with.key and file2.with.key and join them,
but suppose those file are huge, we don't want to copy them over filesystem. Instead we can use something called bash
process substitution to generate output into named pipe (this will avoid any
unnecessary intermediate file creation). For more info please read the provided link.
Our target syntax is: join <( some command ) <(some other command)
The last thing is to explain fancy join arguments: -j1 -o1.2,1.3,1.4,1.5,2.5
-j1
- join by key in 1st column (in both files)-o
- output only those fields 1.2
(1st file field2), 1.3
(1st file column 3), etc.
This way we joined lines, but join
outputs only the necessary columns.
The lessons learned from this post should be:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With