How can I calculate matrix mean values along a matrix, but to remove nan
values from calculation? (For R people, think na.rm = TRUE
).
Here is my [non-]working example:
import numpy as np dat = np.array([[1, 2, 3], [4, 5, np.nan], [np.nan, 6, np.nan], [np.nan, np.nan, np.nan]]) print(dat) print(dat.mean(1)) # [ 2. nan nan nan]
With NaNs removed, my expected output would be:
array([ 2., 4.5, 6., nan])
average doesn't ignore NaN like np.
isnan(a)) # Use a mask to mark the NaNs a_norm = a / np. sum(a) # The sum function ignores the masked values. a_norm2 = a / np. std(a) # The std function ignores the masked values.
Nan is returned for slices that contain only NaNs. The arithmetic mean is the sum of the non-NaN elements along the axis divided by the number of non-NaN elements. Note that for floating-point input, the mean is computed using the same precision the input has.
I think what you want is a masked array:
dat = np.array([[1,2,3], [4,5,nan], [nan,6,nan], [nan,nan,nan]]) mdat = np.ma.masked_array(dat,np.isnan(dat)) mm = np.mean(mdat,axis=1) print mm.filled(np.nan) # the desired answer
Edit: Combining all of the timing data
from timeit import Timer setupstr=""" import numpy as np from scipy.stats.stats import nanmean dat = np.random.normal(size=(1000,1000)) ii = np.ix_(np.random.randint(0,99,size=50),np.random.randint(0,99,size=50)) dat[ii] = np.nan """ method1=""" mdat = np.ma.masked_array(dat,np.isnan(dat)) mm = np.mean(mdat,axis=1) mm.filled(np.nan) """ N = 2 t1 = Timer(method1, setupstr).timeit(N) t2 = Timer("[np.mean([l for l in d if not np.isnan(l)]) for d in dat]", setupstr).timeit(N) t3 = Timer("np.array([r[np.isfinite(r)].mean() for r in dat])", setupstr).timeit(N) t4 = Timer("np.ma.masked_invalid(dat).mean(axis=1)", setupstr).timeit(N) t5 = Timer("nanmean(dat,axis=1)", setupstr).timeit(N) print 'Time: %f\tRatio: %f' % (t1,t1/t1 ) print 'Time: %f\tRatio: %f' % (t2,t2/t1 ) print 'Time: %f\tRatio: %f' % (t3,t3/t1 ) print 'Time: %f\tRatio: %f' % (t4,t4/t1 ) print 'Time: %f\tRatio: %f' % (t5,t5/t1 )
Returns:
Time: 0.045454 Ratio: 1.000000 Time: 8.179479 Ratio: 179.950595 Time: 0.060988 Ratio: 1.341755 Time: 0.070955 Ratio: 1.561029 Time: 0.065152 Ratio: 1.433364
If performance matters, you should use bottleneck.nanmean()
instead:
http://pypi.python.org/pypi/Bottleneck
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With