How can I do it? File1 looks like this: <pre class="prettyprint"><code>foo 1 scaf 3 bar 2 scaf 3.3 </code></pre> File2 looks like this: <pre class="prettyprint"><code>foo 1 scaf 4.5 foo 1 boo 2.3 bar 2 scaf 1.00 </code></pre> What I want to do is to find lines that co-occur in File1 and File2 when fields 1,2, and 3 are the same. Is there a way to do it?

Here is the correct answer (in terms of using standard GNU coreutils tools, and not writing custom script in perl/awk you name it). <pre class="prettyprint"><code>$ join -j1 -o1.2,1.3,1.4,1.5,2.5 <(<file1 awk '{print $1"-"$2"-"$3" "$0}' | sort -k1,1) <(<file2 awk '{print $1"-"$2"-"$3" "$0}' | sort -k1,1) bar 2 scaf 3.3 1.00 foo 1 scaf 3 4.5 </code></pre> OK, how does it work: <ol> <li> First of all we will use a great tool <code>join</code> which can merge two lines. <code>join</code> has two requirements: <ul> <li>We can join only by a single field.</li> <li>Both files must be sorted by key column!</li> </ul> </li> <li> We need to generate keys in input files and for that we use a simple <code>awk</code> script: <pre class="prettyprint"><code>$ cat file1 foo 1 scaf 3 bar 2 scaf 3.3 $ <file1 awk '{print $1"-"$2"-"$3" "$0}' foo-1-scaf foo 1 scaf 3 bar-2-scaf bar 2 scaf 3.3 </code></pre> You see, we added 1st column with some key like "foo-1-scaf". We do the same with file2. BTW. <code><file awk</code>, is just fancy way of writing <code>awk file</code>, or <code>cat file | awk</code>. We also should sort our files by the key, in our case this is column 1, so we add to the end of the command the <code>| sort -k1,1</code> (sort by text from column 1 to column 1) </li> <li> At this point we could just generate files file1.with.key and file2.with.key and join them, but suppose those file are huge, we don't want to copy them over filesystem. Instead we can use something called <code>bash</code> process substitution to generate output into named pipe (this will avoid any unnecessary intermediate file creation). For more info please read the provided link. Our target syntax is: <code>join <( some command ) <(some other command)</code> </li> <li> The last thing is to explain fancy join arguments: <code>-j1 -o1.2,1.3,1.4,1.5,2.5</code> <ul> <li> <code>-j1</code> - join by key in 1st column (in both files)</li> <li> <code>-o</code> - output only those fields <code>1.2</code> (1st file field2), <code>1.3</code> (1st file column 3), etc. This way we joined lines, but <code>join</code> outputs only the necessary columns. </li> </ul> </li> </ol> The lessons learned from this post should be: <ul> <li>you should master the coreutils package, those tools are very powerful when combined and you almost never need to write custom program to deal with such cases,</li> <li>core utils tools are also blazing fast and heavily tested, so they are always best choice.</li> </ul>

Joining multiple fields in text files on Unix

Tags:

linux

bash

join

unix

How can I do it?

File1 looks like this:

foo 1 scaf 3 
bar 2 scaf 3.3

File2 looks like this:

foo 1 scaf 4.5
foo 1 boo 2.3
bar 2 scaf 1.00

What I want to do is to find lines that co-occur in File1 and File2 when fields 1,2, and 3 are the same.

Is there a way to do it?

790

asked Apr 12 '10 02:04

neversaint

1 Answers

Here is the correct answer (in terms of using standard GNU coreutils tools, and not writing custom script in perl/awk you name it).

$ join -j1 -o1.2,1.3,1.4,1.5,2.5 <(<file1 awk '{print $1"-"$2"-"$3" "$0}' | sort -k1,1) <(<file2 awk '{print $1"-"$2"-"$3" "$0}' | sort -k1,1)
bar 2 scaf 3.3 1.00
foo 1 scaf 3 4.5

OK, how does it work:

First of all we will use a great tool join which can merge two lines. join has two requirements:
- We can join only by a single field.
- Both files must be sorted by key column!
We need to generate keys in input files and for that we use a simple awk script:
```
$ cat file1
foo 1 scaf 3
bar 2 scaf 3.3 

$ <file1 awk '{print $1"-"$2"-"$3" "$0}'
foo-1-scaf foo 1 scaf 3
bar-2-scaf bar 2 scaf 3.3
```
You see, we added 1st column with some key like "foo-1-scaf". We do the same with file2. BTW. <file awk, is just fancy way of writing awk file, or cat file | awk.

We also should sort our files by the key, in our case this is column 1, so we add to the end of the command the | sort -k1,1 (sort by text from column 1 to column 1)
At this point we could just generate files file1.with.key and file2.with.key and join them, but suppose those file are huge, we don't want to copy them over filesystem. Instead we can use something called bash process substitution to generate output into named pipe (this will avoid any unnecessary intermediate file creation). For more info please read the provided link.

Our target syntax is: join <( some command ) <(some other command)
The last thing is to explain fancy join arguments: -j1 -o1.2,1.3,1.4,1.5,2.5
- -j1 - join by key in 1st column (in both files)
- -o - output only those fields 1.2 (1st file field2), 1.3 (1st file column 3), etc.
  
  This way we joined lines, but join outputs only the necessary columns.

The lessons learned from this post should be:

you should master the coreutils package, those tools are very powerful when combined and you almost never need to write custom program to deal with such cases,
core utils tools are also blazing fast and heavily tested, so they are always best choice.

answered Sep 18 '22 23:09

thedk

Related questions
                            
                                How to format XML document in Linux
                            
                                Using Environment Variables in cURL Command - Unix
                            
                                Removing Parts of String With Sed
                            
                                gcc -O4 optimization flag
                            
                                Shutdown Windows machine from linux terminal
                            
                                Get seconds since epoch in Linux
                            
                                syntax error near unexpected token ' - bash
                            
                                How to untar all .tar.gz with shell-script?
                            
                                Shell command to find files in a directory pattern
                            
                                find command in bash script resulting in "No such file or directory" error only for directories?
                            
                                How to kill python script with bash script
                            
                                Renaming lots of files in Linux according to a pattern [closed]
                            
                                Stop printing php error messages to browser
                            
                                Can i store unix permissions in a zip file (built with apache ant)?
                            
                                fastboot and adb not working with sudo
                            
                                How can I downgrade or use PHP 7.2 without uninstalling PHP 7.4? Is it possible to use PHP 7.2 as default instead of the latest version?
                            
                                Why can't you sleep while holding spinlock?
                            
                                PhoneGap - Building phonegap android app gives compile error on Linux
                            
                                How can I re-add a unicode byte order marker in linux?
                            
                                What is the difference between long long and long int

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With