I have this code:
awk '!seen[$1,$2]++{a[$1]=(a[$1] ? a[$1]", " : "\t") $2} END{for (i in a) print i a[i]} ' inputfile
and I would like to be working to collapse rows with more than two fields but always base on first field as index.
Input file (three column tab-delimited):
protein_1   membrane    1e-4
protein_1   intracellular   1e-5
protein_2   membrane    1e-50
protein_2   citosol 1e-40
Desired output (three column tab-delimited):
protein_1   membrane, intracellular 1e-4, 1e-5
protein_2   membrane, citosol   1e-50, 1e-40
Thanks!
Stack here:
awk '!seen[$1,$2]++{a[$1]=(a[$1] ? a[$1]"\t" : "\t") $2};{a[$1]=(a[$1] ? a[$1]", " : "\t") $3} END{for (i in a) print i a[i]} ' 1 inputfile
                With GNU awk for 2-D arrays:
$ gawk '
{ a[$1][$2] = $3 }
END {
    for (i in a) {
        printf "%s", i
        sep = "\t"
        for (j in a[i]) {
            printf "%s%s", sep, j
            sep = ", "
        }
        sep = "\t"
        for (j in a[i]) {
            printf "%s%s", sep, a[i][j]
            sep = ", "
        }
        print ""
    }
}' file
protein_1       membrane, intracellular 1e-4, 1e-5
protein_2       membrane, citosol       1e-50, 1e-40
                        perl -lane'
  $ar = $h{shift @F} ||= [];
  push @{$ar->[$_]}, $F[$_] for 0,1;
  END {
    $" = ", ";
    print "$_\t@{$h{$_}[0]}\t@{$h{$_}[1]}" for sort keys %h;
  }
' file
output
protein_1 membrane, intracellular 1e-4, 1e-5
protein_2 membrane, citosol 1e-50, 1e-40
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With