>>> import numpy as np
>>> from scipy import stats
>>> a = np.r_[1., 2., np.nan, 4., 5.]
>>> stats.nanmean(a)
2.9999999999999996
>>> np.nansum(a)/np.sum(~np.isnan(a))
3.0
I'm aware of the limitation of floating point representation. Just curious why the more clumsy expression seems to give "better" result.
First of all, here is scipy.nanmean()
so that we know what we're comparing to:
def nanmean(x, axis=0):
x, axis = _chk_asarray(x,axis)
x = x.copy()
Norig = x.shape[axis]
factor = 1.0-np.sum(np.isnan(x),axis)*1.0/Norig
x[np.isnan(x)] = 0
return np.mean(x,axis)/factor
Mathematically, the two methods are equivalent. Numerically, they are different.
Your method involves a single division, and it so happens that:
1. + 2. + 4. + 5.
) can be represented exactly as a float
; and4.
) is a power of two.This means that the result of the division is exact, 3.
.
stats.nanmean()
involves first computing the mean of [1., 2., 0., 4., 5.]
, and then adjusting it to account for NaNs
. As it happens, this mean (2.4
) cannot be represented exactly as a float
, so from this point on the computation is inexact.
I haven't given it a lot of thought, but it may be possible to construct an example where the roles would be reversed, and stats.nanmean()
would give a more accurate result than the other method.
What surprises me is that stats.nanmean()
doesn't simply do something like:
In [6]: np.mean(np.ma.MaskedArray(a, np.isnan(a)))
Out[6]: 3.0
This seems to me to be a superior approach to what it does currently.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With