Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a Perl statistics package that doesn't make me load the entire dataset at once?

I'm looking for a statistics package for Perl (CPAN is fine) that allows me to add data incrementally instead of having to pass in an entire array of data.

Just the mean, median, stddev, max, and min is necessary, nothing too complicated.

The reason for this is because my dataset is entirely too large to fit into memory. The data source is in a MySQL database, so right now I'm just querying a subset of the data and computing the statistics for them, then combining all the manageable subsets later.

If you have other ideas on how to overcome this issue, I'd be much obliged!

like image 736
Han Avatar asked Sep 10 '09 15:09

Han


5 Answers

You can not do an exact stddev and a median unless you either keep the whole thing in memory or run through the data twice.

UPDATE While you can not do an exact stddev IN ONE PASS, there's an approximation one-pass algorithm, the link is in a comment to this answer.

The rest are completely trivial (no need for a module) to do in 3-5 lines of Perl. STDDEV/Median can be done in 2 passes fairly trivially as well (I just rolled out a script that did exactly what you described, but for IP reasons I'm pretty sure I'm not allowed to post it as example for you, sorry)

Sample code:

my ($min, $max)
my $sum = 0;
my $count = 0;
while (<>) {
    chomp;
    my $current_value = $_; #assume input is 1 value/line for simplicity sake
    $sum += $current_value;
    $count++;
    $min = $current_value if (!defined $min || $min > $current_value);
    $max = $current_value if (!defined $max || $max < $current_value);
}
my $mean = $sum * 1.0 / $count;
my $sum_mean_diffs_2 = 0;

while (<>) { # Second pass to compute stddev (use for median too)
    chomp;
    my $current_value = $_; 
    $sum_mean_diffs += ($current_value - $mean) * ($current_value - $mean);
}
my $std_dev = sqrt($sum_mean_diffs / $count);
# Median is left as excercise for the reader.
like image 165
DVK Avatar answered Nov 16 '22 02:11

DVK


Why don't you simply ask the database for the values you are trying to compute?

Amongst others, MySQL features GROUP BY (Aggregate) functions. For missing functions, all you need is a little SQL.

like image 39
innaM Avatar answered Nov 16 '22 02:11

innaM


PDL might provide a possible solution:

Have a look at this previous SO answer which shows how to get means, std dev, etc.

Here is relevant part of code repeated here:

use strict;
use warnings;
use PDL;

my $figs = pdl [
    [0.01, 0.01, 0.02, 0.04, 0.03],
    [0.00, 0.02, 0.02, 0.03, 0.02],
    [0.01, 0.02, 0.02, 0.03, 0.02],
    [0.01, 0.00, 0.01, 0.05, 0.03],
];

my ( $mean, $prms, $median, $min, $max, $adev, $rms ) = statsover( $figs );
like image 39
draegtun Avatar answered Nov 16 '22 01:11

draegtun


Statistics::Descriptive::Discrete allows you to do this in a manner similar to Statistics::Descriptive, but has been optimized for use with large data sets. (The documentation reports an improvement by two orders of magnitude (100x) in memory usage, for example).

like image 22
Blue Smith Avatar answered Nov 16 '22 00:11

Blue Smith


@DVK: The one-pass algorithms for calculating mean and standard deviation here http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm are not approximations, and are more numerically robust than the example you give. See references on that page.

like image 32
mabraham Avatar answered Nov 16 '22 01:11

mabraham