I have a name.txt file of one column, e.g. <pre class="prettyprint"><code>A B C D E F </code></pre> Then I have many files, e.g. x.txt, y.txt and z.txt x.txt has <pre class="prettyprint"><code>A 1 C 3 D 2 </code></pre> y.txt has <pre class="prettyprint"><code>A 1 B 4 E 3 </code></pre> z.txt has <pre class="prettyprint"><code>B 2 D 2 F 1 </code></pre> The desirable output is (filling in 0 if there is no mapping) <pre class="prettyprint"><code>A 1 1 0 B 0 4 2 C 3 0 0 D 2 0 2 E 0 3 0 F 0 0 1 </code></pre> Is it possible to make it with bash? (perhaps awk?) Many thanks!!! <hr> first edits - my tentative efforts Since I am quite new to bash, it is really hard for me to figure out a possible solution with awk. I'm more familiar with R, in which this can be accomplished by <pre class="prettyprint"><code>namematrix[namematrix[,1]==xmatrix[,1],] </code></pre> All in all, I really appreciate the kind help below helping me learn more about <code>awk</code> and <code>join</code>! <hr> Second-time edits - a super efficient approach figured out! Luckily inspired by some really brilliant answers below, I have sorted out a very computationally efficient way as below. This may be helpful to other people encountering similar questions, in particular if they deal with very large number of files with very large size. Firstly touch a join_awk.bash <pre class="prettyprint lang-sh prettyprint-override"><code>#!/bin/bash join -oauto -e0 -a1 $1 $2 | awk '{print $2}' </code></pre> For instance, execute this bash script for name.txt and x.txt <pre class="prettyprint"><code>join_awk.bash name.txt x.txt </code></pre> would generate <pre class="prettyprint"><code>1 0 3 2 0 0 </code></pre> Note that here I only keep the second column to save disk space, because in my dataset the first columns are very long names that would take tremendous disk space. Then simply implement <pre class="prettyprint lang-sh prettyprint-override"><code>parallel join_awk.bash name.txt {} \> outdir/output.{} ::: {a,b,c}.txt </code></pre> This is inspired by the brilliant answer below using GNU parallel and join. The difference is that the answer below has to specify <code>j1</code> for <code>parallel</code> due to its serial appending logic, which makes it not really "parallel". Also, the speed will become slower and slower as the serial appending continues. In contrast, here we manipulate each file separately in parallel. It can be extremely fast when we deal with large number of large-size files with multiple CPUs. Finally just merge all the single-column output files together by <pre class="prettyprint lang-sh prettyprint-override"><code>cd outdir paste output* > merged.txt </code></pre> This will be also very fast since <code>paste</code> is inherently parallel.

You may use this <code>awk</code>: <pre class="prettyprint lang-sh prettyprint-override"><code>awk 'NF == 2 { map[FILENAME,$1] = $2 next } { printf "%s", $1 for (f=1; f<ARGC-1; ++f) printf "%s", OFS map[ARGV[f],$1]+0 print "" }' {x,y,z}.txt name.txt </code></pre> <pre class="prettyprint"><code>A 1 1 0 B 0 4 2 C 3 0 0 D 2 0 2 E 0 3 0 F 0 0 1 </code></pre>

bash add/append new columns from other files

I have a name.txt file of one column, e.g.

A
B
C
D
E
F

Then I have many files, e.g. x.txt, y.txt and z.txt

x.txt has

A 1
C 3
D 2

y.txt has

A 1
B 4
E 3

z.txt has

B 2
D 2
F 1

The desirable output is (filling in 0 if there is no mapping)

Is it possible to make it with bash? (perhaps awk?)
Many thanks!!!

first edits - my tentative efforts
Since I am quite new to bash, it is really hard for me to figure out a possible solution with awk. I'm more familiar with R, in which this can be accomplished by

namematrix[namematrix[,1]==xmatrix[,1],]

All in all, I really appreciate the kind help below helping me learn more about awk and join!

Second-time edits - a super efficient approach figured out!

Luckily inspired by some really brilliant answers below, I have sorted out a very computationally efficient way as below. This may be helpful to other people encountering similar questions, in particular if they deal with very large number of files with very large size.

Firstly touch a join_awk.bash

#!/bin/bash
join -oauto -e0 -a1 $1 $2 | awk '{print $2}'

For instance, execute this bash script for name.txt and x.txt

join_awk.bash name.txt x.txt

would generate

Note that here I only keep the second column to save disk space, because in my dataset the first columns are very long names that would take tremendous disk space.

Then simply implement

parallel join_awk.bash name.txt {} \> outdir/output.{} ::: {a,b,c}.txt

This is inspired by the brilliant answer below using GNU parallel and join. The difference is that the answer below has to specify j1 for parallel due to its serial appending logic, which makes it not really "parallel". Also, the speed will become slower and slower as the serial appending continues. In contrast, here we manipulate each file separately in parallel. It can be extremely fast when we deal with large number of large-size files with multiple CPUs.

Finally just merge all the single-column output files together by

cd outdir
paste output* > merged.txt

This will be also very fast since paste is inherently parallel.

How do I add a column from one file to another in Unix?

For reference, paste -d' ' file1 file2 will paste file2 to file 1 (as an appended column) with a space as a delimiter; paste -d'\t' file1 file2 will paste with a tab as a delimiter. E.g. paste -d'\t' file1. tsv file2.

How do I append in bash?

To make a new file in Bash, you normally use > for redirection, but to append to an existing file, you would use >> . Take a look at the examples below to see how it works. To append some text to the end of a file, you can use echo and redirect the output to be appended to a file.

How do you append in terminal?

Use the echo command, used with the append redirection operator, to add a single line of text to a file. This adds the message Remember to back up mail files by end of week. to the end of the file notes.

You may use this awk:

awk 'NF == 2 {
   map[FILENAME,$1] = $2
   next
}
{
   printf "%s", $1
   for (f=1; f<ARGC-1; ++f)
      printf "%s", OFS map[ARGV[f],$1]+0
   print ""
}' {x,y,z}.txt name.txt

bash add/append new columns from other files

Tags:

awk

gnu-parallel

Elfxy

People also ask

1 Answers

anubhava

Recent Activity

Donate For Us

bash add/append new columns from other files

Tags:

awk

gnu-parallel

Elfxy

People also ask

1 Answers

anubhava

Related questions

Recent Activity

Donate For Us