I've got a numpy array filled mostly with real numbers, but there is a few <code>nan</code> values in it as well. How can I replace the <code>nan</code>s with averages of columns where they are?

<h3>Using masked arrays</h3> The standard way to do this using only numpy would be to use the masked array module. Scipy is a pretty heavy package which relies on external libraries, so it's worth having a numpy-only method. This borrows from @DonaldHobson's answer. Edit: <code>np.nanmean</code> is now a numpy function. However, it doesn't handle all-nan columns... Suppose you have an array <code>a</code>: <pre class="prettyprint"><code>>>> a array([[ 0., nan, 10., nan], [ 1., 6., nan, nan], [ 2., 7., 12., nan], [ 3., 8., nan, nan], [ nan, 9., 14., nan]]) >>> import numpy.ma as ma >>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=0), a) array([[ 0. , 7.5, 10. , 0. ], [ 1. , 6. , 12. , 0. ], [ 2. , 7. , 12. , 0. ], [ 3. , 8. , 12. , 0. ], [ 1.5, 9. , 14. , 0. ]]) </code></pre> Note that the masked array's mean does not need to be the same shape as <code>a</code>, because we're taking advantage of the implicit broadcasting over rows. Also note how the all-nan column is nicely handled. The mean is zero since you're taking the mean of zero elements. The method using <code>nanmean</code> doesn't handle all-nan columns: <pre class="prettyprint"><code>>>> col_mean = np.nanmean(a, axis=0) /home/praveen/.virtualenvs/numpy3-mkl/lib/python3.4/site-packages/numpy/lib/nanfunctions.py:675: RuntimeWarning: Mean of empty slice warnings.warn("Mean of empty slice", RuntimeWarning) >>> inds = np.where(np.isnan(a)) >>> a[inds] = np.take(col_mean, inds[1]) >>> a array([[ 0. , 7.5, 10. , nan], [ 1. , 6. , 12. , nan], [ 2. , 7. , 12. , nan], [ 3. , 8. , 12. , nan], [ 1.5, 9. , 14. , nan]]) </code></pre> <hr> Explanation Converting <code>a</code> into a masked array gives you <pre class="prettyprint"><code>>>> ma.array(a, mask=np.isnan(a)) masked_array(data = [[0.0 -- 10.0 --] [1.0 6.0 -- --] [2.0 7.0 12.0 --] [3.0 8.0 -- --] [-- 9.0 14.0 --]], mask = [[False True False True] [False False True True] [False False False True] [False False True True] [ True False False True]], fill_value = 1e+20) </code></pre> And taking the mean over columns gives you the correct answer, normalizing only over the non-masked values: <pre class="prettyprint"><code>>>> ma.array(a, mask=np.isnan(a)).mean(axis=0) masked_array(data = [1.5 7.5 12.0 --], mask = [False False False True], fill_value = 1e+20) </code></pre> Further, note how the mask nicely handles the column which is all-nan! Finally, <code>np.where</code> does the job of replacement. <hr> Row-wise mean To replace <code>nan</code> values with row-wise mean instead of column-wise mean requires a tiny change for broadcasting to take effect nicely: <pre class="prettyprint"><code>>>> a array([[ 0., 1., 2., 3., nan], [ nan, 6., 7., 8., 9.], [ 10., nan, 12., nan, 14.], [ nan, nan, nan, nan, nan]]) >>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=1), a) ValueError: operands could not be broadcast together with shapes (4,5) (4,) (4,5) >>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=1)[:, np.newaxis], a) array([[ 0. , 1. , 2. , 3. , 1.5], [ 7.5, 6. , 7. , 8. , 9. ], [ 10. , 12. , 12. , 12. , 14. ], [ 0. , 0. , 0. , 0. , 0. ]]) </code></pre>

numpy array: replace nan values with average of columns

2 Answers

No loops required:

print(a) [[ 0.93230948         nan  0.47773439  0.76998063]  [ 0.94460779  0.87882456  0.79615838  0.56282885]  [ 0.94272934  0.48615268  0.06196785         nan]  [ 0.64940216  0.74414127         nan         nan]]  #Obtain mean of columns as you need, nanmean is convenient. col_mean = np.nanmean(a, axis=0) print(col_mean) [ 0.86726219  0.7030395   0.44528687  0.66640474]  #Find indices that you need to replace inds = np.where(np.isnan(a))  #Place column means in the indices. Align the arrays using take a[inds] = np.take(col_mean, inds[1])  print(a) [[ 0.93230948  0.7030395   0.47773439  0.76998063]  [ 0.94460779  0.87882456  0.79615838  0.56282885]  [ 0.94272934  0.48615268  0.06196785  0.66640474]  [ 0.64940216  0.74414127  0.44528687  0.66640474]]

answered Sep 22 '22 20:09

Daniel

Using masked arrays

The standard way to do this using only numpy would be to use the masked array module.

Scipy is a pretty heavy package which relies on external libraries, so it's worth having a numpy-only method. This borrows from @DonaldHobson's answer.

Edit: np.nanmean is now a numpy function. However, it doesn't handle all-nan columns...

Suppose you have an array a:

>>> a
array([[  0.,  nan,  10.,  nan],
       [  1.,   6.,  nan,  nan],
       [  2.,   7.,  12.,  nan],
       [  3.,   8.,  nan,  nan],
       [ nan,   9.,  14.,  nan]])

>>> import numpy.ma as ma
>>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=0), a)    
array([[  0. ,   7.5,  10. ,   0. ],
       [  1. ,   6. ,  12. ,   0. ],
       [  2. ,   7. ,  12. ,   0. ],
       [  3. ,   8. ,  12. ,   0. ],
       [  1.5,   9. ,  14. ,   0. ]])

Note that the masked array's mean does not need to be the same shape as a, because we're taking advantage of the implicit broadcasting over rows.

Also note how the all-nan column is nicely handled. The mean is zero since you're taking the mean of zero elements. The method using nanmean doesn't handle all-nan columns:

>>> col_mean = np.nanmean(a, axis=0)
/home/praveen/.virtualenvs/numpy3-mkl/lib/python3.4/site-packages/numpy/lib/nanfunctions.py:675: RuntimeWarning: Mean of empty slice
  warnings.warn("Mean of empty slice", RuntimeWarning)
>>> inds = np.where(np.isnan(a))
>>> a[inds] = np.take(col_mean, inds[1])
>>> a
array([[  0. ,   7.5,  10. ,   nan],
       [  1. ,   6. ,  12. ,   nan],
       [  2. ,   7. ,  12. ,   nan],
       [  3. ,   8. ,  12. ,   nan],
       [  1.5,   9. ,  14. ,   nan]])

Explanation

Converting a into a masked array gives you

>>> ma.array(a, mask=np.isnan(a))
masked_array(data =
 [[0.0 --  10.0 --]
  [1.0 6.0 --   --]
  [2.0 7.0 12.0 --]
  [3.0 8.0 --   --]
  [--  9.0 14.0 --]],
             mask =
 [[False  True False  True]
 [False False  True  True]
 [False False False  True]
 [False False  True  True]
 [ True False False  True]],
       fill_value = 1e+20)

And taking the mean over columns gives you the correct answer, normalizing only over the non-masked values:

>>> ma.array(a, mask=np.isnan(a)).mean(axis=0)
masked_array(data = [1.5 7.5 12.0 --],
             mask = [False False False  True],
       fill_value = 1e+20)

Further, note how the mask nicely handles the column which is all-nan!

Finally, np.where does the job of replacement.

Row-wise mean

To replace nan values with row-wise mean instead of column-wise mean requires a tiny change for broadcasting to take effect nicely:

>>> a
array([[  0.,   1.,   2.,   3.,  nan],
       [ nan,   6.,   7.,   8.,   9.],
       [ 10.,  nan,  12.,  nan,  14.],
       [ nan,  nan,  nan,  nan,  nan]])

>>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=1), a)
ValueError: operands could not be broadcast together with shapes (4,5) (4,) (4,5)

>>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=1)[:, np.newaxis], a)
array([[  0. ,   1. ,   2. ,   3. ,   1.5],
       [  7.5,   6. ,   7. ,   8. ,   9. ],
       [ 10. ,  12. ,  12. ,  12. ,  14. ],
       [  0. ,   0. ,   0. ,   0. ,   0. ]])

answered Sep 18 '22 20:09

Praveen

Related questions
                            
                                Equivalent of NotImplementedError for fields in Python
                            
                                Simulate autofit column in xslxwriter
                            
                                Parallelizing a Numpy vector operation
                            
                                convert a grayscale image to a 3-channel image [duplicate]
                            
                                python pandas dataframe slicing by date conditions
                            
                                Why does numpy.power return 0 for small exponents while math.pow returns the correct answer?
                            
                                Joining byte list with python
                            
                                Pillow in Python won't let me open image ("exceeds limit")
                            
                                Unicode identifiers in Python?
                            
                                Adding Custom Django Model Validation
                            
                                Python urlparse -- extract domain name without subdomain
                            
                                Make a Python asyncio call from a Flask route
                            
                                how to have a directory dialog
                            
                                How to add package data recursively in Python setup.py?
                            
                                Python scikit learn MLPClassifier "hidden_layer_sizes"
                            
                                Jupyter notebook: No connection to server because websocket connection fails
                            
                                Difference between "detach()" and "with torch.nograd()" in PyTorch?
                            
                                Python: Disable images in Selenium Google ChromeDriver
                            
                                Serializing list to JSON
                            
                                Sum one number to every element in a list (or array) in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

numpy array: replace nan values with average of columns

Tags:

python

arrays

nan

numpy

piokuc

People also ask

2 Answers

Daniel

Using masked arrays

Praveen

Recent Activity

Donate For Us