Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

bash: clean outer join of three files, preserving file-membership

Consider the following three files with headers in the first row:

file1:

id name in1
1 jon 1
2 sue 1

file2:

id name in2
2 sue 1
3 bob 1

file3:

id name in3
2 sue 1
3 adam 1

I want to merge these files to get the following output, merged_files:

id name in1 in2 in3
1 jon 1 0 0
2 sue 1 1 1
3 bob 0 1 0
3 adam 0 0 1

This request has several special features that I have not found implemented in a handy way in grep/sed/awk/join etc. Edit: You may assume, for simplicity, that the three files have already been sorted.

like image 293
zkurtz Avatar asked Jan 19 '26 10:01

zkurtz


2 Answers

This is very similar to the problem solved in Bash script to find matching rows from multiple CSV files. It's not identical, but it is very similar. (So similar that I only had to remove three sort commands, change the three sed commands slightly, change the file names, change the 'missing' value from no to 0, and change the replacement in the final sed from comma to space.)

The join command with sed (usually sort too, but the data is already sufficiently sorted) are the primary tools needed. Assume that : does not appear in the original data. To record the presence of a row in a file, we want a 1 field in the file (it's almost there); we'll have join supply the 0 when there isn't a match. The 1 at the end of each non-heading line needs to become :1, and the last field in the heading also needs to be preceded by the :. Then, using bash's process substitution, we can write:

$ sed 's/[ ]\([^ ]*\)$/:\1/' file1 |
> join -t: -a 1 -a 2 -e 0 -o 0,1.2,2.2     - <(sed 's/[ ]\([^ ]*\)$/:\1/' file2) |
> join -t: -a 1 -a 2 -e 0 -o 0,1.2,1.3,2.2 - <(sed 's/[ ]\([^ ]*\)$/:\1/' file3) |
> sed 's/:/ /g'
id name in1 in2 in3
1 jon 1 0 0
2 sue 1 1 1
3 adam 0 0 1
3 bob 0 1 0
$

The sed command (three times) adds the : before the last field in each line of the files. The joins are very nearly symmetric. The -t: specifies that the field separator is the colon; the -a 1 and -a 2 mean that when there isn't a match in a file, the line will still be included in the output; the -e 0 means that if there isn't a match in a file, a 0 is generated in the output; and the -o option specifies the output columns. For the first join, -o 0,1.2,2.2 the output is the join column (0), then the second column (the 1) from the two files. The second join has 3 columns in the input, so it specifies -o 0,1.2,1.3,2.2. The argument - on its own means 'read standard input'. The <(...) notation is 'process substitution', where a file name (usually /dev/fd/NN) is provided to the join command, and it contains the output of the command inside the parentheses. The output is then filtered through sed once more to replace the colons with spaces, yielding the desired output.

The only difference from the desired output is the sequencing of 3 bob after 3 adam; it is not particularly clear on what basis you ordered them in reverse in your desired output. If it is crucial, a means can be devised for resolving the order differently (such as sort -k1,1 -k3,5, except that sorts the label line after the data; there are workarounds for that if necessary).

like image 158
Jonathan Leffler Avatar answered Jan 21 '26 07:01

Jonathan Leffler


Code for GNU awk:

{
if ($1=="id") { v[i++]=$3; next }
b[$1,$2]=$1" "$2
c[i-1,$1" "$2]=$3
}

END {
printf ("id name")
for (x in v) printf (" %s", v[x]); printf ("\n")
for (y in b)  {
    printf ("%s", b[y])
    for (z in v) if (c[z,b[y]]==0) {printf (" 0")} else printf (" %s", c[z,b[y]])
    printf ("\n")
    }
}
$cat file?
id name in1
1 jon 1
2 sue 1
id name in2
2 sue 1
3 bob 1
id name in3
2 sue 1
3 adam 1

$awk -f prog.awk file?
id name in1 in2 in3
3 bob 0 1 0
3 adam 0 0 1
1 jon 1 0 0
2 sue 1 1 1
like image 25
captcha Avatar answered Jan 21 '26 07:01

captcha