Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

bash add/append new columns from other files

I have a name.txt file of one column, e.g.

A
B
C
D
E
F

Then I have many files, e.g. x.txt, y.txt and z.txt

x.txt has

A 1
C 3
D 2

y.txt has

A 1
B 4
E 3

z.txt has

B 2
D 2
F 1

The desirable output is (filling in 0 if there is no mapping)

A 1 1 0
B 0 4 2
C 3 0 0
D 2 0 2
E 0 3 0
F 0 0 1

Is it possible to make it with bash? (perhaps awk?)
Many thanks!!!


first edits - my tentative efforts
Since I am quite new to bash, it is really hard for me to figure out a possible solution with awk. I'm more familiar with R, in which this can be accomplished by

namematrix[namematrix[,1]==xmatrix[,1],]

All in all, I really appreciate the kind help below helping me learn more about awk and join!


Second-time edits - a super efficient approach figured out!

Luckily inspired by some really brilliant answers below, I have sorted out a very computationally efficient way as below. This may be helpful to other people encountering similar questions, in particular if they deal with very large number of files with very large size.

Firstly touch a join_awk.bash

#!/bin/bash
join -oauto -e0 -a1 $1 $2 | awk '{print $2}'

For instance, execute this bash script for name.txt and x.txt

join_awk.bash name.txt x.txt

would generate

1
0
3
2
0
0

Note that here I only keep the second column to save disk space, because in my dataset the first columns are very long names that would take tremendous disk space.

Then simply implement

parallel join_awk.bash name.txt {} \> outdir/output.{} ::: {a,b,c}.txt

This is inspired by the brilliant answer below using GNU parallel and join. The difference is that the answer below has to specify j1 for parallel due to its serial appending logic, which makes it not really "parallel". Also, the speed will become slower and slower as the serial appending continues. In contrast, here we manipulate each file separately in parallel. It can be extremely fast when we deal with large number of large-size files with multiple CPUs.

Finally just merge all the single-column output files together by

cd outdir
paste output* > merged.txt

This will be also very fast since paste is inherently parallel.

like image 473
Elfxy Avatar asked Nov 24 '20 05:11

Elfxy


People also ask

How do I add a column from one file to another in Unix?

For reference, paste -d' ' file1 file2 will paste file2 to file 1 (as an appended column) with a space as a delimiter; paste -d'\t' file1 file2 will paste with a tab as a delimiter. E.g. paste -d'\t' file1. tsv file2.

How do I append in bash?

To make a new file in Bash, you normally use > for redirection, but to append to an existing file, you would use >> . Take a look at the examples below to see how it works. To append some text to the end of a file, you can use echo and redirect the output to be appended to a file.

How do you append in terminal?

Use the echo command, used with the append redirection operator, to add a single line of text to a file. This adds the message Remember to back up mail files by end of week. to the end of the file notes.


1 Answers

You may use this awk:

awk 'NF == 2 {
   map[FILENAME,$1] = $2
   next
}
{
   printf "%s", $1
   for (f=1; f<ARGC-1; ++f)
      printf "%s", OFS map[ARGV[f],$1]+0
   print ""
}' {x,y,z}.txt name.txt
A 1 1 0
B 0 4 2
C 3 0 0
D 2 0 2
E 0 3 0
F 0 0 1
like image 166
anubhava Avatar answered Sep 24 '22 12:09

anubhava