Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does scipy.stats.nanmean give different result from numpy.nansum?

>>> import numpy as np
>>> from scipy import stats
>>> a = np.r_[1., 2., np.nan, 4., 5.]
>>> stats.nanmean(a)
2.9999999999999996
>>> np.nansum(a)/np.sum(~np.isnan(a))
3.0

I'm aware of the limitation of floating point representation. Just curious why the more clumsy expression seems to give "better" result.

like image 676
herrlich10 Avatar asked Dec 20 '22 10:12

herrlich10


1 Answers

First of all, here is scipy.nanmean() so that we know what we're comparing to:

def nanmean(x, axis=0):
    x, axis = _chk_asarray(x,axis)
    x = x.copy()
    Norig = x.shape[axis]
    factor = 1.0-np.sum(np.isnan(x),axis)*1.0/Norig

    x[np.isnan(x)] = 0
    return np.mean(x,axis)/factor

Mathematically, the two methods are equivalent. Numerically, they are different.

Your method involves a single division, and it so happens that:

  • the numerator (1. + 2. + 4. + 5.) can be represented exactly as a float; and
  • the denominator (4.) is a power of two.

This means that the result of the division is exact, 3..

stats.nanmean() involves first computing the mean of [1., 2., 0., 4., 5.], and then adjusting it to account for NaNs. As it happens, this mean (2.4) cannot be represented exactly as a float, so from this point on the computation is inexact.

I haven't given it a lot of thought, but it may be possible to construct an example where the roles would be reversed, and stats.nanmean() would give a more accurate result than the other method.

What surprises me is that stats.nanmean() doesn't simply do something like:

In [6]: np.mean(np.ma.MaskedArray(a, np.isnan(a)))
Out[6]: 3.0

This seems to me to be a superior approach to what it does currently.

like image 87
NPE Avatar answered Dec 26 '22 12:12

NPE