Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way in the shell to do basic statistics?

There are so many goodies that come with a modern Unix shell environment that the thing I need is almost always installed on my machine or a quick download away; the trouble is just finding it. In this case, I'm trying to find basic statistical operations.

For example, right now I'm prototyping a crawler-based app. Thanks to wget plus some other goodies, I now have a few hundred thousand files. So I can estimate the cost of doing this with billions of files, I'd like to get the mean and median of file sizes over a certain limit. E.g.:

% ls -l | perl -ne '@a=split(/\s+/); next if $a[4] <100; print $a[4], "\n"' > sizes
% median sizes
% mean sizes

Sure, I could code my own median and mean bits in a little bit of perl or awk. But isn't there already some noob-friendly package that does this and a lot more besides?

like image 432
William Pietri Avatar asked Nov 09 '10 20:11

William Pietri


1 Answers

Can you install R ? Then littler and its r command can help:

~/svn/littler/examples$ ls -l . | awk '!/^total/ {print $5}' 
87
1747
756
988
959
871
~/svn/littler/examples$ ls -l . | awk '!/^total/ {print $5}' | ./fsizes.r 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     87     785     915     901     981    1750 

  The decimal point is 3 digit(s) to the right of the |

  0 | 1
  0 | 89
  1 | 00
  1 | 7

~/svn/littler/examples$ cat fsizes.r 
#!/usr/bin/r -i

fsizes <- as.integer(readLines())
print(summary(fsizes))
stem(fsizes)

This is example we had used before, hence the R function summary() which contains median() and mean() as well as an ascii-art alike stem plot. Generalization to just calling median() or mean() are of course pretty straightforward.

like image 179
Dirk Eddelbuettel Avatar answered Sep 28 '22 23:09

Dirk Eddelbuettel