There are so many goodies that come with a modern Unix shell environment that the thing I need is almost always installed on my machine or a quick download away; the trouble is just finding it. In this case, I'm trying to find basic statistical operations.
For example, right now I'm prototyping a crawler-based app. Thanks to wget plus some other goodies, I now have a few hundred thousand files. So I can estimate the cost of doing this with billions of files, I'd like to get the mean and median of file sizes over a certain limit. E.g.:
% ls -l | perl -ne '@a=split(/\s+/); next if $a[4] <100; print $a[4], "\n"' > sizes
% median sizes
% mean sizes
Sure, I could code my own median and mean bits in a little bit of perl or awk. But isn't there already some noob-friendly package that does this and a lot more besides?
Can you install R ? Then littler and its r
command can help:
~/svn/littler/examples$ ls -l . | awk '!/^total/ {print $5}'
87
1747
756
988
959
871
~/svn/littler/examples$ ls -l . | awk '!/^total/ {print $5}' | ./fsizes.r
Min. 1st Qu. Median Mean 3rd Qu. Max.
87 785 915 901 981 1750
The decimal point is 3 digit(s) to the right of the |
0 | 1
0 | 89
1 | 00
1 | 7
~/svn/littler/examples$ cat fsizes.r
#!/usr/bin/r -i
fsizes <- as.integer(readLines())
print(summary(fsizes))
stem(fsizes)
This is example we had used before, hence the R function summary()
which contains median()
and mean()
as well as an ascii-art alike stem
plot. Generalization to just calling median()
or mean()
are of course pretty straightforward.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With