Here's my input file :
1.37987
1.21448
0.624999
1.28966
1.77084
1.088
1.41667
I would like to create bins of a size of my choice to get histogram-like output, e.g. something like this for 0.1 bins, starting from 0 :
0 0.1 0
...
0.5 0.6 0
0.6 0.7 1
...
1.0 1.1 1
1.1 1.2 0
1.2 1.3 2
1.3 1.4 1
...
My file is too big for R, so I'm looking for an awk solution (also open to anything else that I can understand, as I'm still a Linux beginner).
This was sort of already answered in this post : awk histogram in buckets but the solution is not working for me.
This should be very close if not exactly right. Consider it a starting point at least and verify/figure out the math yourself (in particular decide/verify which bucket(s) an exact boundary match like 0.2 should go into - 0.1 to 0.2 and/or 0.2 to 0.3?):
$ cat tst.awk
BEGIN { delta = (delta == "" ? 0.1 : delta) }
{
bucketNr = int(($0+delta) / delta)
cnt[bucketNr]++
numBuckets = (numBuckets > bucketNr ? numBuckets : bucketNr)
}
END {
for (bucketNr=1; bucketNr<=numBuckets; bucketNr++) {
end = beg + delta
printf "%0.1f %0.1f %d\n", beg, end, cnt[bucketNr]
beg = end
}
}
$ awk -f tst.awk file
0.0 0.1 0
0.1 0.2 0
0.2 0.3 0
0.3 0.4 0
0.4 0.5 0
0.5 0.6 0
0.6 0.7 1
0.7 0.8 0
0.8 0.9 0
0.9 1.0 0
1.0 1.1 1
1.1 1.2 0
1.2 1.3 2
1.3 1.4 1
1.4 1.5 1
1.5 1.6 0
1.6 1.7 0
1.7 1.8 1
Note that you can assign the bucket delta size on the command line, 0.1 is just the default value:
$ awk -v delta='0.3' -f tst.awk file
0.0 0.3 0
0.3 0.6 0
0.6 0.9 1
0.9 1.2 1
1.2 1.5 4
1.5 1.8 1
$ awk -v delta='0.5' -f tst.awk file
0.0 0.5 0
0.5 1.0 1
1.0 1.5 5
1.5 2.0 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With