Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculating percentages in arbitrary number of columns

Tags:

awk

bsd

Given this sample input:

ID     Sample1     Sample2      Sample3
One      10          0            5
Two      3           6            8
Three    3           4            7

I needed to produce this output using AWK:

ID    Sample1 Sample2 Sample3
One   62.50   0.00    25.00
Two   18.75   60.00   40.00
Three 18.75   40.00   35.00

This is how I solved it:

function percent(value, total) {
    return sprintf("%.2f", 100 * value / total)
}
{
    label[NR] = $1
    for (i = 2; i <= NF; ++i) {
        sum[i] += col[i][NR] = $i
    }
}
END {
    title = label[1]
    for (i = 2; i <= length(col) + 1; ++i) {
        title = title "\t" col[i][1]
    }
    print title
    for (j = 2; j <= NR; ++j) {
        line = label[j]
        for (i = 2; i <= length(col) + 1; ++i) {
            line = line "\t" percent(col[i][j], sum[i])
        }
        print line
    }
}

This works fine in GNU AWK (awk in Linux, gawk in BSD), but not in BSD AWK, where I get this error:

$ awk -f script.awk sample.txt
awk: syntax error at source line 7 source file script.awk
 context is
          sum[i] += >>>  col[i][ <<<
awk: illegal statement at source line 7 source file script.awk
awk: illegal statement at source line 7 source file script.awk

It seems the problem is with the multidimensional arrays. I'd like to make this script work in BSD AWK too, so it's more portable.

Is there a way to change this to make it work in BSD AWK?

like image 362
janos Avatar asked Dec 23 '14 15:12

janos


2 Answers

Try using the pseudo-2-dimensional form. Instead of

col[i][NR]

use

col[i,NR]

That is a 1-dimensional array, the key is the concatenated string: i SUBSEP NR

like image 196
glenn jackman Avatar answered Sep 30 '22 02:09

glenn jackman


@glenn's answer got me on the right path. It took a bit more work though:

  • Using col[i, NR] made dealing with the column titles troublesome. It helped a lot to remove the buffering of the column titles and print them immediately after reading
  • length(col) + 1 was no longer usable in the final loop condition, as using col[i, j] made the loops infinite. As a workaround, I could replace length(col) + 1 with simply NF

Here's the final implementation, which now works in both GNU and BSD version of AWK:

function percent(value, total) {
    return sprintf("%.2f", 100 * value / total)
}
BEGIN { OFS = "\t" }
NR == 1 { gsub(/ +/, OFS); print }
NR != 1 {
    label[NR] = $1
    for (i = 2; i <= NF; ++i) {
        sum[i] += col[i, NR] = $i
    }
}
END {
    for (j = 2; j <= NR; ++j) {
        line = label[j]
        for (i = 2; i <= NF; ++i) {
            line = line OFS percent(col[i, j], sum[i])
        }
        print line
    }
}
like image 30
janos Avatar answered Sep 30 '22 03:09

janos