Is there a Perl statistics package that doesn't make me load the entire dataset at once?

Question

I'm looking for a statistics package for Perl (CPAN is fine) that allows me to add data incrementally instead of having to pass in an entire array of data.

Just the mean, median, stddev, max, and min is necessary, nothing too complicated.

The reason for this is because my dataset is entirely too large to fit into memory. The data source is in a MySQL database, so right now I'm just querying a subset of the data and computing the statistics for them, then combining all the manageable subsets later.

If you have other ideas on how to overcome this issue, I'd be much obliged!

DVK · Accepted Answer

You can not do an exact stddev and a median unless you either keep the whole thing in memory or run through the data twice.

UPDATE While you can not do an exact stddev IN ONE PASS, there's an approximation one-pass algorithm, the link is in a comment to this answer.

The rest are completely trivial (no need for a module) to do in 3-5 lines of Perl. STDDEV/Median can be done in 2 passes fairly trivially as well (I just rolled out a script that did exactly what you described, but for IP reasons I'm pretty sure I'm not allowed to post it as example for you, sorry)

Sample code:

my ($min, $max)
my $sum = 0;
my $count = 0;
while (<>) {
    chomp;
    my $current_value = $_; #assume input is 1 value/line for simplicity sake
    $sum += $current_value;
    $count++;
    $min = $current_value if (!defined $min || $min > $current_value);
    $max = $current_value if (!defined $max || $max < $current_value);
}
my $mean = $sum * 1.0 / $count;
my $sum_mean_diffs_2 = 0;

while (<>) { # Second pass to compute stddev (use for median too)
    chomp;
    my $current_value = $_; 
    $sum_mean_diffs += ($current_value - $mean) * ($current_value - $mean);
}
my $std_dev = sqrt($sum_mean_diffs / $count);
# Median is left as excercise for the reader.

innaM · Answer

Why don't you simply ask the database for the values you are trying to compute?

Amongst others, MySQL features GROUP BY (Aggregate) functions. For missing functions, all you need is a little SQL.

draegtun · Answer

PDL might provide a possible solution:

Have a look at this previous SO answer which shows how to get means, std dev, etc.

Here is relevant part of code repeated here:

use strict;
use warnings;
use PDL;

my $figs = pdl [
    [0.01, 0.01, 0.02, 0.04, 0.03],
    [0.00, 0.02, 0.02, 0.03, 0.02],
    [0.01, 0.02, 0.02, 0.03, 0.02],
    [0.01, 0.00, 0.01, 0.05, 0.03],
];

my ( $mean, $prms, $median, $min, $max, $adev, $rms ) = statsover( $figs );

Blue Smith · Answer

Statistics::Descriptive::Discrete allows you to do this in a manner similar to Statistics::Descriptive, but has been optimized for use with large data sets. (The documentation reports an improvement by two orders of magnitude (100x) in memory usage, for example).

mabraham · Answer

@DVK: The one-pass algorithms for calculating mean and standard deviation here http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm are not approximations, and are more numerically robust than the example you give. See references on that page.

Is there a Perl statistics package that doesn't make me load the entire dataset at once?

Tags:

memory

perl

statistics

Han

5 Answers

DVK

innaM

draegtun

Blue Smith

mabraham

Recent Activity

Donate For Us

Is there a Perl statistics package that doesn't make me load the entire dataset at once?

Tags:

memory

perl

statistics

Han

5 Answers

DVK

innaM

draegtun

Blue Smith

mabraham

Related questions

Recent Activity

Donate For Us