Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory efficient statistical distribution module

I'd like to analyze some data (say, web-service response times) and get various statistical info, mainly percentiles/quantiles and presence of outstanding values.

I know about Statistics::Descriptive, however, I don't want to store all the data in memory. On the other hand, having my results off by a few % would be fine, I only care about huge differences.

So I came up with the following idea: create an array of logarithmic buckets, and count data points landing in each bucket. Having the data spread across 6 orders of magnitude and guaranteed precision of 1% still leaves me with 6 * log 10 / log 1.01 =~ 1400 buckets which is perfectly fine (36 kb of memory, given current Perl's scalar size).

Counting percentiles is simple - just add up bucket counters until $sum exceeds $percentage * $total_count.

However, before I start writing actual code, I would like to ask which memory efficient statistical modules (for Perl) and algorithms already exist.

I have found this question, and there's similar method proposed in one of the answers. Haven't found a ready-made Perl implementation, though.

This is a slightly edited version of this Perlmonks question.

like image 629
Dallaylaen Avatar asked Jun 10 '13 10:06

Dallaylaen


1 Answers

As my search has been unsuccessful so far, I've started a new module Statistics::Descriptive::LogScale. Hope it will be helpful.

It generally follows the API of Statistics::Descriptive::Full, with several minor additions (like added central and standardized moments of arbitrary powers). I also plan to take a much closer look at Statistics::Descriptive::Weighted.

#!/usr/bin/perl -w

use strict;
use Statistics::Descriptive::LogScale;

my $stat = Statistics::Descriptive::LogScale->new ();
while(<>) { 
    $stat->add_data(m/(-?\d+(?:\.\d*))/g);
};

# This can also be done in O(1) memory, precisely
printf "Average: %f +- %f\n", 
    $stat->mean, $stat->standard_deviation;

# This requires storing actual data, or approximating
foreach (0.5, 1, 5, 10, 25, 50, 75, 90, 95, 99, 99.5) {
    printf "Percentile($_): %f\n", $stat->percentile($_);
};
like image 63
Dallaylaen Avatar answered Nov 03 '22 00:11

Dallaylaen