I have a name.txt file of one column, e.g.
A
B
C
D
E
F
Then I have many files, e.g. x.txt, y.txt and z.txt
x.txt has
A 1
C 3
D 2
y.txt has
A 1
B 4
E 3
z.txt has
B 2
D 2
F 1
The desirable output is (filling in 0 if there is no mapping)
A 1 1 0
B 0 4 2
C 3 0 0
D 2 0 2
E 0 3 0
F 0 0 1
Is it possible to make it with bash? (perhaps awk?)
Many thanks!!!
first edits - my tentative efforts
Since I am quite new to bash, it is really hard for me to figure out a possible solution with awk. I'm more familiar with R, in which this can be accomplished by
namematrix[namematrix[,1]==xmatrix[,1],]
All in all, I really appreciate the kind help below helping me learn more about awk
and join
!
Second-time edits - a super efficient approach figured out!
Luckily inspired by some really brilliant answers below, I have sorted out a very computationally efficient way as below. This may be helpful to other people encountering similar questions, in particular if they deal with very large number of files with very large size.
Firstly touch a join_awk.bash
#!/bin/bash
join -oauto -e0 -a1 $1 $2 | awk '{print $2}'
For instance, execute this bash script for name.txt and x.txt
join_awk.bash name.txt x.txt
would generate
1
0
3
2
0
0
Note that here I only keep the second column to save disk space, because in my dataset the first columns are very long names that would take tremendous disk space.
Then simply implement
parallel join_awk.bash name.txt {} \> outdir/output.{} ::: {a,b,c}.txt
This is inspired by the brilliant answer below using GNU parallel and join. The difference is that the answer below has to specify j1
for parallel
due to its serial appending logic, which makes it not really "parallel". Also, the speed will become slower and slower as the serial appending continues. In contrast, here we manipulate each file separately in parallel. It can be extremely fast when we deal with large number of large-size files with multiple CPUs.
Finally just merge all the single-column output files together by
cd outdir
paste output* > merged.txt
This will be also very fast since paste
is inherently parallel.
For reference, paste -d' ' file1 file2 will paste file2 to file 1 (as an appended column) with a space as a delimiter; paste -d'\t' file1 file2 will paste with a tab as a delimiter. E.g. paste -d'\t' file1. tsv file2.
To make a new file in Bash, you normally use > for redirection, but to append to an existing file, you would use >> . Take a look at the examples below to see how it works. To append some text to the end of a file, you can use echo and redirect the output to be appended to a file.
Use the echo command, used with the append redirection operator, to add a single line of text to a file. This adds the message Remember to back up mail files by end of week. to the end of the file notes.
You may use this awk
:
awk 'NF == 2 {
map[FILENAME,$1] = $2
next
}
{
printf "%s", $1
for (f=1; f<ARGC-1; ++f)
printf "%s", OFS map[ARGV[f],$1]+0
print ""
}' {x,y,z}.txt name.txt
A 1 1 0
B 0 4 2
C 3 0 0
D 2 0 2
E 0 3 0
F 0 0 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With