Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most efficient way to forward-fill NaN values in numpy array

Example Problem

As a simple example, consider the numpy array arr as defined below:

import numpy as np arr = np.array([[5, np.nan, np.nan, 7, 2],                 [3, np.nan, 1, 8, np.nan],                 [4, 9, 6, np.nan, np.nan]]) 

where arr looks like this in console output:

array([[  5.,  nan,  nan,   7.,   2.],        [  3.,  nan,   1.,   8.,  nan],        [  4.,   9.,   6.,  nan,  nan]]) 

I would now like to row-wise 'forward-fill' the nan values in array arr. By that I mean replacing each nan value with the nearest valid value from the left. The desired result would look like this:

array([[  5.,   5.,   5.,  7.,  2.],        [  3.,   3.,   1.,  8.,  8.],        [  4.,   9.,   6.,  6.,  6.]]) 

Tried thus far

I've tried using for-loops:

for row_idx in range(arr.shape[0]):     for col_idx in range(arr.shape[1]):         if np.isnan(arr[row_idx][col_idx]):             arr[row_idx][col_idx] = arr[row_idx][col_idx - 1] 

I've also tried using a pandas dataframe as an intermediate step (since pandas dataframes have a very neat built-in method for forward-filling):

import pandas as pd df = pd.DataFrame(arr) df.fillna(method='ffill', axis=1, inplace=True) arr = df.as_matrix() 

Both of the above strategies produce the desired result, but I keep on wondering: wouldn't a strategy that uses only numpy vectorized operations be the most efficient one?


Summary

Is there another more efficient way to 'forward-fill' nan values in numpy arrays? (e.g. by using numpy vectorized operations)


Update: Solutions Comparison

I've tried to time all solutions thus far. This was my setup script:

import numba as nb import numpy as np import pandas as pd  def random_array():     choices = [1, 2, 3, 4, 5, 6, 7, 8, 9, np.nan]     out = np.random.choice(choices, size=(1000, 10))     return out  def loops_fill(arr):     out = arr.copy()     for row_idx in range(out.shape[0]):         for col_idx in range(1, out.shape[1]):             if np.isnan(out[row_idx, col_idx]):                 out[row_idx, col_idx] = out[row_idx, col_idx - 1]     return out  @nb.jit def numba_loops_fill(arr):     '''Numba decorator solution provided by shx2.'''     out = arr.copy()     for row_idx in range(out.shape[0]):         for col_idx in range(1, out.shape[1]):             if np.isnan(out[row_idx, col_idx]):                 out[row_idx, col_idx] = out[row_idx, col_idx - 1]     return out  def pandas_fill(arr):     df = pd.DataFrame(arr)     df.fillna(method='ffill', axis=1, inplace=True)     out = df.as_matrix()     return out  def numpy_fill(arr):     '''Solution provided by Divakar.'''     mask = np.isnan(arr)     idx = np.where(~mask,np.arange(mask.shape[1]),0)     np.maximum.accumulate(idx,axis=1, out=idx)     out = arr[np.arange(idx.shape[0])[:,None], idx]     return out 

followed by this console input:

%timeit -n 1000 loops_fill(random_array()) %timeit -n 1000 numba_loops_fill(random_array()) %timeit -n 1000 pandas_fill(random_array()) %timeit -n 1000 numpy_fill(random_array()) 

resulting in this console output:

1000 loops, best of 3: 9.64 ms per loop 1000 loops, best of 3: 377 µs per loop 1000 loops, best of 3: 455 µs per loop 1000 loops, best of 3: 351 µs per loop 
like image 590
Xukrao Avatar asked Dec 16 '16 19:12

Xukrao


People also ask

Is appending to NumPy array faster than list?

array(a) . List append is faster than array append .

How does NumPy array deal with NaN values?

The most common way to do so is by using the . fillna() method. This method requires you to specify a value to replace the NaNs with.

How do I forward NaN values in pandas?

ffill() function is used to fill the missing value in the dataframe. 'ffill' stands for 'forward fill' and will propagate last valid observation forward. inplace : If True, fill in place.

How to initialize NumPy array by NaN values in Python?

Initialize NumPy array by NaN values using np.full () In this Python program, we are initializing a NumPy array of shapes (2,3) and using the numpy full () function to initialize the array with the same identical value. 3. Initialize NumPy array by NaN values using np.fill ()

How to forward fill an array with multiple NaNs?

In case of several np.nan s in a row (either in the beginning or in the middle), just repeat this operation several times. For instance, if the array has 5 consecutive np.nan s, the following code will "forward fill" all of them with the number before these np.nan s:

What is the best way to forward fill data in Python?

bottleneck push function is a good option to forward fill. It's normally used internally in packages like Xarray, it should be faster than other alternatives and the package also has a set of benchmarks.

How to flatten n-dimensional arrays in Afik pandas?

Here's a generalized function for n-dimensional arrays: AFIK pandas can only work with two dimensions, despite having multi-index to make up for it. The only way to accomplish this would be to flatten a DataFrame, unstack desired level, restack, and finally reshape as original.


1 Answers

Here's one approach -

mask = np.isnan(arr) idx = np.where(~mask,np.arange(mask.shape[1]),0) np.maximum.accumulate(idx,axis=1, out=idx) out = arr[np.arange(idx.shape[0])[:,None], idx] 

If you don't want to create another array and just fill the NaNs in arr itself, replace the last step with this -

arr[mask] = arr[np.nonzero(mask)[0], idx[mask]] 

Sample input, output -

In [179]: arr Out[179]:  array([[  5.,  nan,  nan,   7.,   2.,   6.,   5.],        [  3.,  nan,   1.,   8.,  nan,   5.,  nan],        [  4.,   9.,   6.,  nan,  nan,  nan,   7.]])  In [180]: out Out[180]:  array([[ 5.,  5.,  5.,  7.,  2.,  6.,  5.],        [ 3.,  3.,  1.,  8.,  8.,  5.,  5.],        [ 4.,  9.,  6.,  6.,  6.,  6.,  7.]]) 
like image 91
Divakar Avatar answered Oct 14 '22 19:10

Divakar