Suppose that I have two files, <code>en.csv</code> and <code>sp.csv</code>, each containing exactly two comma-separated records: <code>en.csv</code>: <pre class="prettyprint"><code>1,dog,red,car 3,cat,white,boat </code></pre> <code>sp.csv</code>: <pre class="prettyprint"><code>2,conejo,gris,tren 3,gato,blanco,bote </code></pre> If I execute <pre class="prettyprint"><code>join -t, -a 1 -a 2 -e MISSING en.csv sp.csv </code></pre> the output I get is: <pre class="prettyprint"><code>1,dog,red,car 2,conejo,gris,tren 3,cat,white,boat,gato,blanco,bote </code></pre> Notice that all the missing fields have been collapsed. To get a "proper" full outer join, I need to specify a format; thus <pre class="prettyprint"><code>join -t, -a 1 -a 2 -e MISSING -o 0,1.2,1.3,1.4,2.2,2.3,2.4 en.csv sp.csv </code></pre> yields <pre class="prettyprint"><code>1,dog,red,car,MISSING,MISSING,MISSING 2,MISSING,MISSING,MISSING,conejo,gris,tren 3,cat,white,boat,gato,blanco,bote </code></pre> One drawback of this way to produce a full outer join is that one needs to explicitly specify the format of the final table, which may not be easy to do in programmatic applications (where the identity of the joined tables is known only at runtime). Recent versions of GNU <code>join</code> eliminate this shortcoming by supporting the special format <code>auto</code>. Therefore, with such a version of <code>join</code> the last command above could be replaced by the far more general <pre class="prettyprint"><code>join -t, -a 1 -a 2 -e MISSING -o auto en.csv sp.csv </code></pre> How can I achieve this same effect with versions of <code>join</code> that do not support the <code>-o auto</code> option? <hr> Background and details I have a Unix shell (zsh) script that is designed to processes several CSV flatfiles, and does so by making extensive use of GNU <code>join</code>'s '-o auto' option. I need to modify this script so that it can work in environments where the available <code>join</code> command does not support the <code>-o auto</code> option (as is the case for BSD <code>join</code> as well as for older versions of GNU <code>join</code>). A typical use of this option in the script is something like: <pre class="prettyprint"><code>_reccut () { cols="1,$1" shift in=$1 shift if (( $# > 0 )); then join -t, -a 1 -a 2 -e 'MISSING' -o auto \ <( cut -d, -f $cols $in | sort -t, -k1 ) \ <( _reccut "$@" ) else cut -d, -f $cols $in | sort -t, -k1 fi } </code></pre> I show this example to illustrate that it would be difficult to replace <code>-o auto</code> with an explicit format, since the fields to include in this format are not known until runtime. The function <code>_reccut</code> above basically extracts columns from files, and joins the resulting tables along their first column. To see how <code>_reccut</code> in action, imagine that, in addition to the files mentioned above, we also had the file <code>de.csv</code> <pre class="prettyprint"><code>2,Kaninchen,Grau,Zug 1,Hund,Rot,Auto </code></pre> Then, for example, to display side-by-side column 3 of <code>en.csv</code>, columns 2 and 4 of <code>sp.csv</code>, and column 3 of de.csv one would run: <pre class="prettyprint"><code>% _reccut 3 en.csv 2,4 sp.csv 3 de.csv | cut -d, 2- red,MISSING,MISSING,Rot MISSING,conejo,tren,Grau white,gato,bote,MISSING </code></pre>

Here is a solution that might or might not work for your data. It approaches the problem by aligning the records within a csv file by line number, i.e. record <code>2</code> ends up on line <code>2</code>, record <code>3123</code> on line number <code>3123</code> and so on. Missing records/lines are padded with <code>MISSING</code> fields, so the input files would be mangled to look like this: <code>en.csv</code>: <pre class="prettyprint"><code>1,dog,red,car 2,MISSING,MISSING,MISSING 3,cat,white,boat </code></pre> <code>de.csv</code>: <pre class="prettyprint"><code>1,Hund,Rot,Auto 2,Kaninchen,Grau,Zug 3,MISSING,MISSING,MISSING </code></pre> <code>sp.csv</code>: <pre class="prettyprint"><code>1,MISSING,MISSING,MISSING 2,conejo,gris,tren 3,gato,blanco,bote </code></pre> From there it is easy to cut out the columns of interest and just print them side-by-side using <code>paste</code>. To achieve this, we sort the input files first and then apply some stupid <code>awk</code> magic: <ul> <li>If a record appears on their expected line number, print it</li> <li>Otherwise, print as many lines containing the number of expected (this is based on the number of fields of the first line in the file, same as what <code>join -o auto</code> does) <code>MISSING</code> fields until the alignment is correct again</li> <li>Not all input files are going to the same number of records, so the maximum is searched for before all of this. Then, more lines with <code>MISSING</code> fields are printed until the maximum is hit.</li> </ul> <h3>Code</h3> <code>reccut.sh</code>: <pre class="prettyprint"><code>#!/bin/bash get_max_recnum() { awk -F, '{ if ($1 > max) { max = $1 } } END { print max }' "$@" } align_by_recnum() { sort -t, -k1 "$1" \ | awk -F, -v MAXREC="$2" ' NR==1 { for(x = 1; x < NF; x++) missing = missing ",MISSING" } { i = NR if (NR < $1) { while (i < $1) { print i++ missing } NR+=i } }1 END { for(i++; i <= MAXREC; i++) { print i missing } } ' } _reccut() { local infiles=() local args=( $@ ) for arg; do infiles+=( "$2" ) shift 2 done MAXREC="$(get_max_recnum "${infiles[@]}")" __reccut "${args[@]}" } __reccut() { local cols="$1" local infile="$2" shift 2 if (( $# > 0 )); then paste -d, \ <(align_by_recnum "${infile}" "${MAXREC}" | cut -d, -f ${cols}) \ <(__reccut "$@") else align_by_recnum "${infile}" "${MAXREC}" | cut -d, -f ${cols} fi } _reccut "$@" </code></pre> <h3>Run</h3> <pre class="prettyprint"><code>$ ./reccut.sh 3 en.csv 2,4 sp.csv 3 de.csv red,MISSING,MISSING,Rot MISSING,conejo,tren,Grau white,gato,bote,MISSING </code></pre>

How to get all fields in outer join with Unix join?

Tags:

shell

join

unix

gnu

gnu-coreutils

Suppose that I have two files, en.csv and sp.csv, each containing exactly two comma-separated records:

en.csv:

1,dog,red,car
3,cat,white,boat

sp.csv:

2,conejo,gris,tren
3,gato,blanco,bote

If I execute

join -t, -a 1 -a 2 -e MISSING en.csv sp.csv

the output I get is:

1,dog,red,car
2,conejo,gris,tren
3,cat,white,boat,gato,blanco,bote

Notice that all the missing fields have been collapsed. To get a "proper" full outer join, I need to specify a format; thus

join -t, -a 1 -a 2 -e MISSING -o 0,1.2,1.3,1.4,2.2,2.3,2.4 en.csv sp.csv

yields

1,dog,red,car,MISSING,MISSING,MISSING
2,MISSING,MISSING,MISSING,conejo,gris,tren
3,cat,white,boat,gato,blanco,bote

One drawback of this way to produce a full outer join is that one needs to explicitly specify the format of the final table, which may not be easy to do in programmatic applications (where the identity of the joined tables is known only at runtime).

Recent versions of GNU join eliminate this shortcoming by supporting the special format auto. Therefore, with such a version of join the last command above could be replaced by the far more general

join -t, -a 1 -a 2 -e MISSING -o auto en.csv sp.csv

How can I achieve this same effect with versions of join that do not support the -o auto option?

Background and details

I have a Unix shell (zsh) script that is designed to processes several CSV flatfiles, and does so by making extensive use of GNU join's '-o auto' option. I need to modify this script so that it can work in environments where the available join command does not support the -o auto option (as is the case for BSD join as well as for older versions of GNU join).

A typical use of this option in the script is something like:

_reccut () {
    cols="1,$1"
    shift
    in=$1
    shift
    if (( $# > 0 )); then
        join -t, -a 1 -a 2 -e 'MISSING' -o auto \
          <( cut -d, -f $cols $in | sort -t, -k1 ) \
          <( _reccut "$@" )
    else
        cut -d, -f $cols $in | sort -t, -k1
    fi
}

I show this example to illustrate that it would be difficult to replace -o auto with an explicit format, since the fields to include in this format are not known until runtime.

The function _reccut above basically extracts columns from files, and joins the resulting tables along their first column. To see how _reccut in action, imagine that, in addition to the files mentioned above, we also had the file

de.csv

2,Kaninchen,Grau,Zug
1,Hund,Rot,Auto

Then, for example, to display side-by-side column 3 of en.csv, columns 2 and 4 of sp.csv, and column 3 of de.csv one would run:

% _reccut 3 en.csv 2,4 sp.csv 3 de.csv | cut -d, 2-
red,MISSING,MISSING,Rot
MISSING,conejo,tren,Grau
white,gato,bote,MISSING

259

asked Mar 02 '13 20:03

kjo

1 Answers

Here is a solution that might or might not work for your data. It approaches the problem by aligning the records within a csv file by line number, i.e. record 2 ends up on line 2, record 3123 on line number 3123 and so on. Missing records/lines are padded with MISSING fields, so the input files would be mangled to look like this:

en.csv:

1,dog,red,car
2,MISSING,MISSING,MISSING
3,cat,white,boat

de.csv:

1,Hund,Rot,Auto
2,Kaninchen,Grau,Zug
3,MISSING,MISSING,MISSING

sp.csv:

1,MISSING,MISSING,MISSING
2,conejo,gris,tren
3,gato,blanco,bote

From there it is easy to cut out the columns of interest and just print them side-by-side using paste.

To achieve this, we sort the input files first and then apply some stupid awk magic:

If a record appears on their expected line number, print it
Otherwise, print as many lines containing the number of expected (this is based on the number of fields of the first line in the file, same as what join -o auto does) MISSING fields until the alignment is correct again
Not all input files are going to the same number of records, so the maximum is searched for before all of this. Then, more lines with MISSING fields are printed until the maximum is hit.

Code

reccut.sh:

#!/bin/bash

get_max_recnum()
{
    awk -F, '{ if ($1 > max) { max = $1 } } END { print max }' "$@"
}

align_by_recnum()
{
    sort -t, -k1 "$1" \
        | awk -F, -v MAXREC="$2" '
            NR==1 { for(x = 1; x < NF; x++) missing = missing ",MISSING" }
            {
                i = NR
                if (NR < $1)
                {
                    while (i < $1)
                    {
                        print i++ missing
                    }
                    NR+=i
                }
            }1
            END { for(i++; i <= MAXREC; i++) { print i missing } }
            '
}

_reccut()
{
    local infiles=()
    local args=( $@ )
    for arg; do
        infiles+=( "$2" )
        shift 2
    done
    MAXREC="$(get_max_recnum "${infiles[@]}")" __reccut "${args[@]}"
}

__reccut()
{
    local cols="$1"
    local infile="$2"
    shift 2

    if (( $# > 0 )); then
        paste -d, \
            <(align_by_recnum "${infile}" "${MAXREC}" | cut -d, -f ${cols}) \
            <(__reccut "$@")
    else
        align_by_recnum "${infile}" "${MAXREC}" | cut -d, -f ${cols}
    fi
}

_reccut "$@"

Run

$ ./reccut.sh 3 en.csv 2,4 sp.csv 3 de.csv
red,MISSING,MISSING,Rot
MISSING,conejo,tren,Grau
white,gato,bote,MISSING

166

answered Sep 19 '22 09:09

Adrian Frühwirth

Related questions
                            
                                upload zip file to google drive using curl
                            
                                what language uses "fi"
                            
                                Unix convert Month name to number
                            
                                How to pass password from file to mysql command?
                            
                                Output whole line once for each unique value of a column (Bash)
                            
                                How to swap parameters in Shell command
                            
                                List files that are in directory1 but NOT in directory2 and vice versa?
                            
                                cut: can we set multiple spaces as the delimiter?
                            
                                Can't Connect to Mlab
                            
                                Is it possible to make a bash shell script interact with another command line program?
                            
                                Calling a python script from command line without typing "python" first
                            
                                Why doesn't this bash expr command work? [duplicate]
                            
                                Sender and receiver to transfer files over ssh on request?
                            
                                How to run multiple Unix commands in one time?
                            
                                wget and run/remove bash script in one line
                            
                                How to make powershell tell me about missing DLLs?
                            
                                Why does posh fail to perform pathname expansion when a part of the path is specified within double-quotes?
                            
                                launching background process in capistrano task

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With