<h3>Example Problem</h3> <p>As a simple example, consider the numpy array <code>arr</code> as defined below:</p> <pre class="prettyprint"><code>import numpy as np arr = np.array([[5, np.nan, np.nan, 7, 2], [3, np.nan, 1, 8, np.nan], [4, 9, 6, np.nan, np.nan]]) </code></pre> <p>where <code>arr</code> looks like this in console output:</p> <pre class="prettyprint"><code>array([[ 5., nan, nan, 7., 2.], [ 3., nan, 1., 8., nan], [ 4., 9., 6., nan, nan]]) </code></pre> <p>I would now like to row-wise 'forward-fill' the <code>nan</code> values in array <code>arr</code>. By that I mean replacing each <code>nan</code> value with the nearest valid value from the left. The desired result would look like this:</p> <pre class="prettyprint"><code>array([[ 5., 5., 5., 7., 2.], [ 3., 3., 1., 8., 8.], [ 4., 9., 6., 6., 6.]]) </code></pre> <hr> <h3>Tried thus far</h3> <p>I've tried using for-loops:</p> <pre class="prettyprint"><code>for row_idx in range(arr.shape[0]): for col_idx in range(arr.shape[1]): if np.isnan(arr[row_idx][col_idx]): arr[row_idx][col_idx] = arr[row_idx][col_idx - 1] </code></pre> <p>I've also tried using a pandas dataframe as an intermediate step (since pandas dataframes have a very neat built-in method for forward-filling):</p> <pre class="prettyprint"><code>import pandas as pd df = pd.DataFrame(arr) df.fillna(method='ffill', axis=1, inplace=True) arr = df.as_matrix() </code></pre> <p>Both of the above strategies produce the desired result, but I keep on wondering: wouldn't a strategy that uses only numpy vectorized operations be the most efficient one?</p> <hr> <h3>Summary</h3> <p>Is there another more efficient way to 'forward-fill' <code>nan</code> values in numpy arrays? (e.g. by using numpy vectorized operations)</p> <hr> <h3>Update: Solutions Comparison</h3> <p>I've tried to time all solutions thus far. This was my setup script:</p> <pre class="prettyprint"><code>import numba as nb import numpy as np import pandas as pd def random_array(): choices = [1, 2, 3, 4, 5, 6, 7, 8, 9, np.nan] out = np.random.choice(choices, size=(1000, 10)) return out def loops_fill(arr): out = arr.copy() for row_idx in range(out.shape[0]): for col_idx in range(1, out.shape[1]): if np.isnan(out[row_idx, col_idx]): out[row_idx, col_idx] = out[row_idx, col_idx - 1] return out @nb.jit def numba_loops_fill(arr): '''Numba decorator solution provided by shx2.''' out = arr.copy() for row_idx in range(out.shape[0]): for col_idx in range(1, out.shape[1]): if np.isnan(out[row_idx, col_idx]): out[row_idx, col_idx] = out[row_idx, col_idx - 1] return out def pandas_fill(arr): df = pd.DataFrame(arr) df.fillna(method='ffill', axis=1, inplace=True) out = df.as_matrix() return out def numpy_fill(arr): '''Solution provided by Divakar.''' mask = np.isnan(arr) idx = np.where(~mask,np.arange(mask.shape[1]),0) np.maximum.accumulate(idx,axis=1, out=idx) out = arr[np.arange(idx.shape[0])[:,None], idx] return out </code></pre> <p>followed by this console input:</p> <pre class="prettyprint"><code>%timeit -n 1000 loops_fill(random_array()) %timeit -n 1000 numba_loops_fill(random_array()) %timeit -n 1000 pandas_fill(random_array()) %timeit -n 1000 numpy_fill(random_array()) </code></pre> <p>resulting in this console output:</p> <pre class="prettyprint"><code>1000 loops, best of 3: 9.64 ms per loop 1000 loops, best of 3: 377 µs per loop 1000 loops, best of 3: 455 µs per loop 1000 loops, best of 3: 351 µs per loop </code></pre>

<p>Here's one approach -</p> <pre class="prettyprint"><code>mask = np.isnan(arr) idx = np.where(~mask,np.arange(mask.shape[1]),0) np.maximum.accumulate(idx,axis=1, out=idx) out = arr[np.arange(idx.shape[0])[:,None], idx] </code></pre> <p>If you don't want to create another array and just fill the NaNs in <code>arr</code> itself, replace the last step with this -</p> <pre class="prettyprint"><code>arr[mask] = arr[np.nonzero(mask)[0], idx[mask]] </code></pre> <p>Sample input, output -</p> <pre class="prettyprint"><code>In [179]: arr Out[179]: array([[ 5., nan, nan, 7., 2., 6., 5.], [ 3., nan, 1., 8., nan, 5., nan], [ 4., 9., 6., nan, nan, nan, 7.]]) In [180]: out Out[180]: array([[ 5., 5., 5., 7., 2., 6., 5.], [ 3., 3., 1., 8., 8., 5., 5.], [ 4., 9., 6., 6., 6., 6., 7.]]) </code></pre>

Most efficient way to forward-fill NaN values in numpy array

Example Problem

As a simple example, consider the numpy array arr as defined below:

import numpy as np arr = np.array([[5, np.nan, np.nan, 7, 2],                 [3, np.nan, 1, 8, np.nan],                 [4, 9, 6, np.nan, np.nan]])

where arr looks like this in console output:

array([[  5.,  nan,  nan,   7.,   2.],        [  3.,  nan,   1.,   8.,  nan],        [  4.,   9.,   6.,  nan,  nan]])

I would now like to row-wise 'forward-fill' the nan values in array arr. By that I mean replacing each nan value with the nearest valid value from the left. The desired result would look like this:

array([[  5.,   5.,   5.,  7.,  2.],        [  3.,   3.,   1.,  8.,  8.],        [  4.,   9.,   6.,  6.,  6.]])

Tried thus far

I've tried using for-loops:

for row_idx in range(arr.shape[0]):     for col_idx in range(arr.shape[1]):         if np.isnan(arr[row_idx][col_idx]):             arr[row_idx][col_idx] = arr[row_idx][col_idx - 1]

I've also tried using a pandas dataframe as an intermediate step (since pandas dataframes have a very neat built-in method for forward-filling):

import pandas as pd df = pd.DataFrame(arr) df.fillna(method='ffill', axis=1, inplace=True) arr = df.as_matrix()

Both of the above strategies produce the desired result, but I keep on wondering: wouldn't a strategy that uses only numpy vectorized operations be the most efficient one?

Summary

Is there another more efficient way to 'forward-fill' nan values in numpy arrays? (e.g. by using numpy vectorized operations)

Update: Solutions Comparison

I've tried to time all solutions thus far. This was my setup script:

import numba as nb import numpy as np import pandas as pd  def random_array():     choices = [1, 2, 3, 4, 5, 6, 7, 8, 9, np.nan]     out = np.random.choice(choices, size=(1000, 10))     return out  def loops_fill(arr):     out = arr.copy()     for row_idx in range(out.shape[0]):         for col_idx in range(1, out.shape[1]):             if np.isnan(out[row_idx, col_idx]):                 out[row_idx, col_idx] = out[row_idx, col_idx - 1]     return out  @nb.jit def numba_loops_fill(arr):     '''Numba decorator solution provided by shx2.'''     out = arr.copy()     for row_idx in range(out.shape[0]):         for col_idx in range(1, out.shape[1]):             if np.isnan(out[row_idx, col_idx]):                 out[row_idx, col_idx] = out[row_idx, col_idx - 1]     return out  def pandas_fill(arr):     df = pd.DataFrame(arr)     df.fillna(method='ffill', axis=1, inplace=True)     out = df.as_matrix()     return out  def numpy_fill(arr):     '''Solution provided by Divakar.'''     mask = np.isnan(arr)     idx = np.where(~mask,np.arange(mask.shape[1]),0)     np.maximum.accumulate(idx,axis=1, out=idx)     out = arr[np.arange(idx.shape[0])[:,None], idx]     return out

followed by this console input:

%timeit -n 1000 loops_fill(random_array()) %timeit -n 1000 numba_loops_fill(random_array()) %timeit -n 1000 pandas_fill(random_array()) %timeit -n 1000 numpy_fill(random_array())

resulting in this console output:

1000 loops, best of 3: 9.64 ms per loop 1000 loops, best of 3: 377 µs per loop 1000 loops, best of 3: 455 µs per loop 1000 loops, best of 3: 351 µs per loop

590

asked Dec 16 '16 19:12

Xukrao

1 Answers

Here's one approach -

mask = np.isnan(arr) idx = np.where(~mask,np.arange(mask.shape[1]),0) np.maximum.accumulate(idx,axis=1, out=idx) out = arr[np.arange(idx.shape[0])[:,None], idx]

If you don't want to create another array and just fill the NaNs in arr itself, replace the last step with this -

arr[mask] = arr[np.nonzero(mask)[0], idx[mask]]

Sample input, output -

In [179]: arr Out[179]:  array([[  5.,  nan,  nan,   7.,   2.,   6.,   5.],        [  3.,  nan,   1.,   8.,  nan,   5.,  nan],        [  4.,   9.,   6.,  nan,  nan,  nan,   7.]])  In [180]: out Out[180]:  array([[ 5.,  5.,  5.,  7.,  2.,  6.,  5.],        [ 3.,  3.,  1.,  8.,  8.,  5.,  5.],        [ 4.,  9.,  6.,  6.,  6.,  6.,  7.]])

answered Oct 14 '22 19:10

Divakar

Related questions
                            
                                Pip build option to use multicore
                            
                                Python imports for tests using nose - what is best practice for imports of modules above current package
                            
                                Too many different Python versions on my system and causing problems
                            
                                How exactly is Python Bytecode Run in CPython?
                            
                                Difference between class foo and class foo(object) in Python
                            
                                Flask - POST Error 405 Method Not Allowed
                            
                                JSON serialize a dictionary with tuples as key
                            
                                When should I use a Map instead of a For Loop?
                            
                                What is the default __hash__ in python?
                            
                                Is it possible to step backwards in pdb?
                            
                                laying out a large graph with graphviz
                            
                                Pandas - convert strings to time without date
                            
                                How to handle the pylint message: Warning: Method could be a function
                            
                                How to display progress of scipy.optimize function?
                            
                                Clearly documented reading of emails functionality with python win32com outlook
                            
                                String with 'f' prefix in python-3.6
                            
                                System-wide mutex in Python on Linux
                            
                                TypeError: unhashable type: 'dict', when dict used as a key for another dict [duplicate]
                            
                                Make Javascript do List Comprehension
                            
                                Python OpenCV - imshow doesn't need convert from BGR to RGB

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Most efficient way to forward-fill NaN values in numpy array

Tags:

performance

python

arrays

pandas

numpy