As a simple example, consider the numpy array arr
as defined below:
import numpy as np arr = np.array([[5, np.nan, np.nan, 7, 2], [3, np.nan, 1, 8, np.nan], [4, 9, 6, np.nan, np.nan]])
where arr
looks like this in console output:
array([[ 5., nan, nan, 7., 2.], [ 3., nan, 1., 8., nan], [ 4., 9., 6., nan, nan]])
I would now like to row-wise 'forward-fill' the nan
values in array arr
. By that I mean replacing each nan
value with the nearest valid value from the left. The desired result would look like this:
array([[ 5., 5., 5., 7., 2.], [ 3., 3., 1., 8., 8.], [ 4., 9., 6., 6., 6.]])
I've tried using for-loops:
for row_idx in range(arr.shape[0]): for col_idx in range(arr.shape[1]): if np.isnan(arr[row_idx][col_idx]): arr[row_idx][col_idx] = arr[row_idx][col_idx - 1]
I've also tried using a pandas dataframe as an intermediate step (since pandas dataframes have a very neat built-in method for forward-filling):
import pandas as pd df = pd.DataFrame(arr) df.fillna(method='ffill', axis=1, inplace=True) arr = df.as_matrix()
Both of the above strategies produce the desired result, but I keep on wondering: wouldn't a strategy that uses only numpy vectorized operations be the most efficient one?
Is there another more efficient way to 'forward-fill' nan
values in numpy arrays? (e.g. by using numpy vectorized operations)
I've tried to time all solutions thus far. This was my setup script:
import numba as nb import numpy as np import pandas as pd def random_array(): choices = [1, 2, 3, 4, 5, 6, 7, 8, 9, np.nan] out = np.random.choice(choices, size=(1000, 10)) return out def loops_fill(arr): out = arr.copy() for row_idx in range(out.shape[0]): for col_idx in range(1, out.shape[1]): if np.isnan(out[row_idx, col_idx]): out[row_idx, col_idx] = out[row_idx, col_idx - 1] return out @nb.jit def numba_loops_fill(arr): '''Numba decorator solution provided by shx2.''' out = arr.copy() for row_idx in range(out.shape[0]): for col_idx in range(1, out.shape[1]): if np.isnan(out[row_idx, col_idx]): out[row_idx, col_idx] = out[row_idx, col_idx - 1] return out def pandas_fill(arr): df = pd.DataFrame(arr) df.fillna(method='ffill', axis=1, inplace=True) out = df.as_matrix() return out def numpy_fill(arr): '''Solution provided by Divakar.''' mask = np.isnan(arr) idx = np.where(~mask,np.arange(mask.shape[1]),0) np.maximum.accumulate(idx,axis=1, out=idx) out = arr[np.arange(idx.shape[0])[:,None], idx] return out
followed by this console input:
%timeit -n 1000 loops_fill(random_array()) %timeit -n 1000 numba_loops_fill(random_array()) %timeit -n 1000 pandas_fill(random_array()) %timeit -n 1000 numpy_fill(random_array())
resulting in this console output:
1000 loops, best of 3: 9.64 ms per loop 1000 loops, best of 3: 377 µs per loop 1000 loops, best of 3: 455 µs per loop 1000 loops, best of 3: 351 µs per loop
array(a) . List append is faster than array append .
The most common way to do so is by using the . fillna() method. This method requires you to specify a value to replace the NaNs with.
ffill() function is used to fill the missing value in the dataframe. 'ffill' stands for 'forward fill' and will propagate last valid observation forward. inplace : If True, fill in place.
Initialize NumPy array by NaN values using np.full () In this Python program, we are initializing a NumPy array of shapes (2,3) and using the numpy full () function to initialize the array with the same identical value. 3. Initialize NumPy array by NaN values using np.fill ()
In case of several np.nan s in a row (either in the beginning or in the middle), just repeat this operation several times. For instance, if the array has 5 consecutive np.nan s, the following code will "forward fill" all of them with the number before these np.nan s:
bottleneck push function is a good option to forward fill. It's normally used internally in packages like Xarray, it should be faster than other alternatives and the package also has a set of benchmarks.
Here's a generalized function for n-dimensional arrays: AFIK pandas can only work with two dimensions, despite having multi-index to make up for it. The only way to accomplish this would be to flatten a DataFrame, unstack desired level, restack, and finally reshape as original.
Here's one approach -
mask = np.isnan(arr) idx = np.where(~mask,np.arange(mask.shape[1]),0) np.maximum.accumulate(idx,axis=1, out=idx) out = arr[np.arange(idx.shape[0])[:,None], idx]
If you don't want to create another array and just fill the NaNs in arr
itself, replace the last step with this -
arr[mask] = arr[np.nonzero(mask)[0], idx[mask]]
Sample input, output -
In [179]: arr Out[179]: array([[ 5., nan, nan, 7., 2., 6., 5.], [ 3., nan, 1., 8., nan, 5., nan], [ 4., 9., 6., nan, nan, nan, 7.]]) In [180]: out Out[180]: array([[ 5., 5., 5., 7., 2., 6., 5.], [ 3., 3., 1., 8., 8., 5., 5.], [ 4., 9., 6., 6., 6., 6., 7.]])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With