Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NumPy: calculate averages with NaNs removed

Tags:

python

nan

numpy

How can I calculate matrix mean values along a matrix, but to remove nan values from calculation? (For R people, think na.rm = TRUE).

Here is my [non-]working example:

import numpy as np dat = np.array([[1, 2, 3],                 [4, 5, np.nan],                 [np.nan, 6, np.nan],                 [np.nan, np.nan, np.nan]]) print(dat) print(dat.mean(1))  # [  2.  nan  nan  nan] 

With NaNs removed, my expected output would be:

array([ 2.,  4.5,  6.,  nan]) 
like image 220
Mike T Avatar asked Mar 30 '11 00:03

Mike T


People also ask

Does numpy average ignore NaN?

average doesn't ignore NaN like np.

How do I ignore NaN values in numpy?

isnan(a)) # Use a mask to mark the NaNs a_norm = a / np. sum(a) # The sum function ignores the masked values. a_norm2 = a / np. std(a) # The std function ignores the masked values.

How does numpy mean treat NaN?

Nan is returned for slices that contain only NaNs. The arithmetic mean is the sum of the non-NaN elements along the axis divided by the number of non-NaN elements. Note that for floating-point input, the mean is computed using the same precision the input has.


2 Answers

I think what you want is a masked array:

dat = np.array([[1,2,3], [4,5,nan], [nan,6,nan], [nan,nan,nan]]) mdat = np.ma.masked_array(dat,np.isnan(dat)) mm = np.mean(mdat,axis=1) print mm.filled(np.nan) # the desired answer 

Edit: Combining all of the timing data

   from timeit import Timer      setupstr=""" import numpy as np from scipy.stats.stats import nanmean     dat = np.random.normal(size=(1000,1000)) ii = np.ix_(np.random.randint(0,99,size=50),np.random.randint(0,99,size=50)) dat[ii] = np.nan """        method1=""" mdat = np.ma.masked_array(dat,np.isnan(dat)) mm = np.mean(mdat,axis=1) mm.filled(np.nan)     """      N = 2     t1 = Timer(method1, setupstr).timeit(N)     t2 = Timer("[np.mean([l for l in d if not np.isnan(l)]) for d in dat]", setupstr).timeit(N)     t3 = Timer("np.array([r[np.isfinite(r)].mean() for r in dat])", setupstr).timeit(N)     t4 = Timer("np.ma.masked_invalid(dat).mean(axis=1)", setupstr).timeit(N)     t5 = Timer("nanmean(dat,axis=1)", setupstr).timeit(N)      print 'Time: %f\tRatio: %f' % (t1,t1/t1 )     print 'Time: %f\tRatio: %f' % (t2,t2/t1 )     print 'Time: %f\tRatio: %f' % (t3,t3/t1 )     print 'Time: %f\tRatio: %f' % (t4,t4/t1 )     print 'Time: %f\tRatio: %f' % (t5,t5/t1 ) 

Returns:

Time: 0.045454  Ratio: 1.000000 Time: 8.179479  Ratio: 179.950595 Time: 0.060988  Ratio: 1.341755 Time: 0.070955  Ratio: 1.561029 Time: 0.065152  Ratio: 1.433364 
like image 112
JoshAdel Avatar answered Oct 06 '22 02:10

JoshAdel


If performance matters, you should use bottleneck.nanmean() instead:

http://pypi.python.org/pypi/Bottleneck

like image 38
deprecated Avatar answered Oct 06 '22 00:10

deprecated