Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

compute mean in python for a generator

I'm doing some statistics work, I have a (large) collection of random numbers to compute the mean of, I'd like to work with generators, because I just need to compute the mean, so I don't need to store the numbers.

The problem is that numpy.mean breaks if you pass it a generator. I can write a simple function to do what I want, but I'm wondering if there's a proper, built-in way to do this?

It would be nice if I could say "sum(values)/len(values)", but len doesn't work for genetators, and sum already consumed values.

here's an example:

import numpy 

def my_mean(values):
    n = 0
    Sum = 0.0
    try:
        while True:
            Sum += next(values)
            n += 1
    except StopIteration: pass
    return float(Sum)/n

X = [k for k in range(1,7)]
Y = (k for k in range(1,7))

print numpy.mean(X)
print my_mean(Y)

these both give the same, correct, answer, buy my_mean doesn't work for lists, and numpy.mean doesn't work for generators.

I really like the idea of working with generators, but details like this seem to spoil things.

like image 471
nick maxwell Avatar asked Feb 10 '11 23:02

nick maxwell


4 Answers

In general if you're doing a streaming mean calculation of floating point numbers, you're probably better off using a more numerically stable algorithm than simply summing the generator and dividing by the length.

The simplest of these (that I know) is usually credited to Knuth, and also calculates variance. The link contains a python implementation, but just the mean portion is copied here for completeness.

def mean(data):
    n = 0
    mean = 0.0
 
    for x in data:
        n += 1
        mean += (x - mean)/n

    if n < 1:
        return float('nan')
    else:
        return mean

I know this question is super old, but it's still the first hit on google, so it seemed appropriate to post. I'm still sad that the python standard library doesn't contain this simple piece of code.

like image 77
Erik Avatar answered Sep 19 '22 04:09

Erik


Just one simple change to your code would let you use both. Generators were meant to be used interchangeably to lists in a for-loop.

def my_mean(values):
    n = 0
    Sum = 0.0
    for v in values:
        Sum += v
        n += 1
    return Sum / n
like image 39
Mark Ransom Avatar answered Sep 20 '22 04:09

Mark Ransom


def my_mean(values):
    total = 0
    for n, v in enumerate(values, 1):
        total += v
    return total / n

print my_mean(X)
print my_mean(Y)

There is statistics.mean() in Python 3.4 but it calls list() on the input:

def mean(data):
    if iter(data) is data:
        data = list(data)
    n = len(data)
    if n < 1:
        raise StatisticsError('mean requires at least one data point')
    return _sum(data)/n

where _sum() returns an accurate sum (math.fsum()-like function that in addition to float also supports Fraction, Decimal).

like image 29
jfs Avatar answered Sep 22 '22 04:09

jfs


The old-fashioned way to do it:

def my_mean(values):
   sum, n = 0, 0
   for x in values:
      sum += x
      n += 1
   return float(sum)/n
like image 43
Jimmy Avatar answered Sep 22 '22 04:09

Jimmy