Suppose that I have two files, en.csv
and sp.csv
, each containing exactly two comma-separated records:
en.csv
:
1,dog,red,car
3,cat,white,boat
sp.csv
:
2,conejo,gris,tren
3,gato,blanco,bote
If I execute
join -t, -a 1 -a 2 -e MISSING en.csv sp.csv
the output I get is:
1,dog,red,car
2,conejo,gris,tren
3,cat,white,boat,gato,blanco,bote
Notice that all the missing fields have been collapsed. To get a "proper" full outer join, I need to specify a format; thus
join -t, -a 1 -a 2 -e MISSING -o 0,1.2,1.3,1.4,2.2,2.3,2.4 en.csv sp.csv
yields
1,dog,red,car,MISSING,MISSING,MISSING
2,MISSING,MISSING,MISSING,conejo,gris,tren
3,cat,white,boat,gato,blanco,bote
One drawback of this way to produce a full outer join is that one needs to explicitly specify the format of the final table, which may not be easy to do in programmatic applications (where the identity of the joined tables is known only at runtime).
Recent versions of GNU join
eliminate this shortcoming by supporting the special format auto
. Therefore, with such a version of join
the last command above could be replaced by the far more general
join -t, -a 1 -a 2 -e MISSING -o auto en.csv sp.csv
How can I achieve this same effect with versions of join
that do not support the -o auto
option?
Background and details
I have a Unix shell (zsh) script that is designed to processes several CSV flatfiles, and does so by making extensive use of GNU join
's '-o auto' option. I need to modify this script so that it can work in environments where the available join
command does not support the -o auto
option (as is the case for BSD join
as well as for older versions of GNU join
).
A typical use of this option in the script is something like:
_reccut () {
cols="1,$1"
shift
in=$1
shift
if (( $# > 0 )); then
join -t, -a 1 -a 2 -e 'MISSING' -o auto \
<( cut -d, -f $cols $in | sort -t, -k1 ) \
<( _reccut "$@" )
else
cut -d, -f $cols $in | sort -t, -k1
fi
}
I show this example to illustrate that it would be difficult to replace -o auto
with an explicit format, since the fields to include in this format are not known until runtime.
The function _reccut
above basically extracts columns from files, and joins the resulting tables along their first column. To see how _reccut
in action, imagine that, in addition to the files mentioned above, we also had the file
de.csv
2,Kaninchen,Grau,Zug
1,Hund,Rot,Auto
Then, for example, to display side-by-side column 3 of en.csv
, columns 2 and 4 of sp.csv
, and column 3 of de.csv one would run:
% _reccut 3 en.csv 2,4 sp.csv 3 de.csv | cut -d, 2-
red,MISSING,MISSING,Rot
MISSING,conejo,tren,Grau
white,gato,bote,MISSING
NOTE : When using join command, both the input files should be sorted on the KEY on which we are going to join the files. So, the output contains the key followed by all the matching columns from the first file file1. txt, followed by all the columns of second file file2.
The join command in UNIX is a command line utility for joining lines of two files on a common field. It can be used to join two files by selecting fields within the line and joining the files on them. The result is written to standard output.
If you want to merge data from two text files by matching a common field, you can use the Linux join command.
The join command provides us with the ability to merge two files together using a common field in each file as the link between related lines in the files. We can think of the Linux join command the same way we think of SQL joins when we want to join two or more tables in a relational database.
Here is a solution that might or might not work for your data. It approaches the problem by aligning the records within a csv file by line number, i.e. record 2
ends up on line 2
, record 3123
on line number 3123
and so on. Missing records/lines are padded with MISSING
fields, so the input files would be mangled to look like this:
en.csv
:
1,dog,red,car
2,MISSING,MISSING,MISSING
3,cat,white,boat
de.csv
:
1,Hund,Rot,Auto
2,Kaninchen,Grau,Zug
3,MISSING,MISSING,MISSING
sp.csv
:
1,MISSING,MISSING,MISSING
2,conejo,gris,tren
3,gato,blanco,bote
From there it is easy to cut out the columns of interest and just print them side-by-side using paste
.
To achieve this, we sort the input files first and then apply some stupid awk
magic:
join -o auto
does) MISSING
fields until the alignment is correct againMISSING
fields are printed until the maximum is hit.reccut.sh
:
#!/bin/bash
get_max_recnum()
{
awk -F, '{ if ($1 > max) { max = $1 } } END { print max }' "$@"
}
align_by_recnum()
{
sort -t, -k1 "$1" \
| awk -F, -v MAXREC="$2" '
NR==1 { for(x = 1; x < NF; x++) missing = missing ",MISSING" }
{
i = NR
if (NR < $1)
{
while (i < $1)
{
print i++ missing
}
NR+=i
}
}1
END { for(i++; i <= MAXREC; i++) { print i missing } }
'
}
_reccut()
{
local infiles=()
local args=( $@ )
for arg; do
infiles+=( "$2" )
shift 2
done
MAXREC="$(get_max_recnum "${infiles[@]}")" __reccut "${args[@]}"
}
__reccut()
{
local cols="$1"
local infile="$2"
shift 2
if (( $# > 0 )); then
paste -d, \
<(align_by_recnum "${infile}" "${MAXREC}" | cut -d, -f ${cols}) \
<(__reccut "$@")
else
align_by_recnum "${infile}" "${MAXREC}" | cut -d, -f ${cols}
fi
}
_reccut "$@"
$ ./reccut.sh 3 en.csv 2,4 sp.csv 3 de.csv
red,MISSING,MISSING,Rot
MISSING,conejo,tren,Grau
white,gato,bote,MISSING
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With