I've got a numpy array filled mostly with real numbers, but there is a few nan values in it as well.
How can I replace the nans with averages of columns where they are?
In NumPy, to replace missing values NaN ( np. nan ) in ndarray with other numbers, use np. nan_to_num() or np. isnan() .
However, np. average doesn't ignore NaN like np.
No loops required:
print(a) [[ 0.93230948 nan 0.47773439 0.76998063] [ 0.94460779 0.87882456 0.79615838 0.56282885] [ 0.94272934 0.48615268 0.06196785 nan] [ 0.64940216 0.74414127 nan nan]] #Obtain mean of columns as you need, nanmean is convenient. col_mean = np.nanmean(a, axis=0) print(col_mean) [ 0.86726219 0.7030395 0.44528687 0.66640474] #Find indices that you need to replace inds = np.where(np.isnan(a)) #Place column means in the indices. Align the arrays using take a[inds] = np.take(col_mean, inds[1]) print(a) [[ 0.93230948 0.7030395 0.47773439 0.76998063] [ 0.94460779 0.87882456 0.79615838 0.56282885] [ 0.94272934 0.48615268 0.06196785 0.66640474] [ 0.64940216 0.74414127 0.44528687 0.66640474]]
The standard way to do this using only numpy would be to use the masked array module.
Scipy is a pretty heavy package which relies on external libraries, so it's worth having a numpy-only method. This borrows from @DonaldHobson's answer.
Edit: np.nanmean is now a numpy function. However, it doesn't handle all-nan columns...
Suppose you have an array a:
>>> a
array([[ 0., nan, 10., nan],
[ 1., 6., nan, nan],
[ 2., 7., 12., nan],
[ 3., 8., nan, nan],
[ nan, 9., 14., nan]])
>>> import numpy.ma as ma
>>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=0), a)
array([[ 0. , 7.5, 10. , 0. ],
[ 1. , 6. , 12. , 0. ],
[ 2. , 7. , 12. , 0. ],
[ 3. , 8. , 12. , 0. ],
[ 1.5, 9. , 14. , 0. ]])
Note that the masked array's mean does not need to be the same shape as a, because we're taking advantage of the implicit broadcasting over rows.
Also note how the all-nan column is nicely handled. The mean is zero since you're taking the mean of zero elements. The method using nanmean doesn't handle all-nan columns:
>>> col_mean = np.nanmean(a, axis=0)
/home/praveen/.virtualenvs/numpy3-mkl/lib/python3.4/site-packages/numpy/lib/nanfunctions.py:675: RuntimeWarning: Mean of empty slice
warnings.warn("Mean of empty slice", RuntimeWarning)
>>> inds = np.where(np.isnan(a))
>>> a[inds] = np.take(col_mean, inds[1])
>>> a
array([[ 0. , 7.5, 10. , nan],
[ 1. , 6. , 12. , nan],
[ 2. , 7. , 12. , nan],
[ 3. , 8. , 12. , nan],
[ 1.5, 9. , 14. , nan]])
Explanation
Converting a into a masked array gives you
>>> ma.array(a, mask=np.isnan(a))
masked_array(data =
[[0.0 -- 10.0 --]
[1.0 6.0 -- --]
[2.0 7.0 12.0 --]
[3.0 8.0 -- --]
[-- 9.0 14.0 --]],
mask =
[[False True False True]
[False False True True]
[False False False True]
[False False True True]
[ True False False True]],
fill_value = 1e+20)
And taking the mean over columns gives you the correct answer, normalizing only over the non-masked values:
>>> ma.array(a, mask=np.isnan(a)).mean(axis=0)
masked_array(data = [1.5 7.5 12.0 --],
mask = [False False False True],
fill_value = 1e+20)
Further, note how the mask nicely handles the column which is all-nan!
Finally, np.where does the job of replacement.
Row-wise mean
To replace nan values with row-wise mean instead of column-wise mean requires a tiny change for broadcasting to take effect nicely:
>>> a
array([[ 0., 1., 2., 3., nan],
[ nan, 6., 7., 8., 9.],
[ 10., nan, 12., nan, 14.],
[ nan, nan, nan, nan, nan]])
>>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=1), a)
ValueError: operands could not be broadcast together with shapes (4,5) (4,) (4,5)
>>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=1)[:, np.newaxis], a)
array([[ 0. , 1. , 2. , 3. , 1.5],
[ 7.5, 6. , 7. , 8. , 9. ],
[ 10. , 12. , 12. , 12. , 14. ],
[ 0. , 0. , 0. , 0. , 0. ]])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With