How can I use AWK to compute the median of a column of numerical data?
I can think of a simple algorithm but I can't seem to program it:
What I have so far is:
sort | awk 'END{print NR}'
And this gives me the number of elements in the column. I'd like to use this to print a certain row (NR/2)
. If NR/2
is not an integer, then I round up to the nearest integer and that is the median, otherwise I take the average of (NR/2)+1
and (NR/2)-1
.
4 Answers. Show activity on this post. Add the numbers in $2 (second column) in sum (variables are auto-initialized to zero by awk ) and increment the number of rows (which could also be handled via built-in variable NR). At the end, if there was at least one value read, print the average.
With awk
you have to store the values in an array and compute the median at the end, assuming we look at the first column:
sort -n file | awk ' { a[i++]=$1; } END { print a[int(i/2)]; }'
Sure, for real median computation do the rounding as described in the question:
sort -n file | awk ' { a[i++]=$1; }
END { x=int((i+1)/2); if (x < (i+1)/2) print (a[x-1]+a[x])/2; else print a[x-1]; }'
This awk
program assumes one column of numerically sorted data:
#/usr/bin/env awk
{
count[NR] = $1;
}
END {
if (NR % 2) {
print count[(NR + 1) / 2];
} else {
print (count[(NR / 2)] + count[(NR / 2) + 1]) / 2.0;
}
}
Sample usage:
sort -n data_file | awk -f median.awk
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With