Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Numpy 2D array: change all values to the right of NaNs

Situation

I have a 2D Numpy array that contains some nan values. Simplified example:

arr = np.array([[3, 5, np.nan, 2, 4],
                [9, 1, 3, 5, 1],
                [8, np.nan, 3, np.nan, 7]])

which looks like this in console output:

array([[  3.,   5.,  nan,   2.,   4.],
       [  9.,   1.,   3.,   5.,   1.],
       [  8.,  nan,   3.,  nan,   7.]])

Problem

I am looking for a good way to set all values to the right of existing nan values to nan as well. In other words, I need to convert the example array to this:

array([[  3.,   5.,  nan,  nan,  nan],
       [  9.,   1.,   3.,   5.,   1.],
       [  8.,  nan,  nan,  nan,  nan]]) 

I know how to accomplish this with loops, but I would imagine that a method that uses only Numpy vectorized operations would be much more efficient. Is there anyone who could help me find such a method?

like image 371
Xukrao Avatar asked Jan 05 '23 14:01

Xukrao


1 Answers

One approach with cumsum and boolean-indexing -

arr[np.isnan(arr).cumsum(1)>0] = np.nan

For performance, it might be better to use np.maximum.accumulate -

arr[np.maximum.accumulate(np.isnan(arr),axis=1)] = np.nan

One more way with a bit twisted use of broadcasting -

n = arr.shape[1]
mask = np.isnan(arr)
idx = mask.argmax(1)
idx[~mask.any(1)] = n
arr[idx[:,None] <= np.arange(n)] = np.nan

Sample run -

In [96]: arr
Out[96]: 
array([[  3.,   5.,  nan,   2.,   4.],
       [  9.,   1.,   3.,   5.,   1.],
       [  8.,  nan,   3.,  nan,   7.]])

In [97]: arr[np.maximum.accumulate(np.isnan(arr),axis=1)] = np.nan

In [98]: arr
Out[98]: 
array([[  3.,   5.,  nan,  nan,  nan],
       [  9.,   1.,   3.,   5.,   1.],
       [  8.,  nan,  nan,  nan,  nan]])

Benchmarking

Approaches -

def func1(arr):
    arr[np.isnan(arr).cumsum(1)>0] = np.nan

def func2(arr):
    arr[np.maximum.accumulate(np.isnan(arr),axis=1)] = np.nan

def func3(arr): # @ MSeifert's suggestion
    mask = np.isnan(arr); 
    accmask = np.cumsum(mask, out=mask, axis=1); 
    arr[accmask] = np.nan

def func4(arr):
    mask = np.isnan(arr); 
    np.maximum.accumulate(mask,axis=1, out = mask)
    arr[mask] = np.nan

def func5(arr):
    n = arr.shape[1]
    mask = np.isnan(arr)
    idx = mask.argmax(1)
    idx[~mask.any(1)] = n
    arr[idx[:,None] <= np.arange(n)] = np.nan

Timings -

In [201]: # Setup inputs
     ...: arr = np.random.rand(5000,5000)
     ...: arr.ravel()[np.random.choice(range(arr.size), 10000, replace=0)] = np.nan
     ...: arr1 = arr.copy()
     ...: arr2 = arr.copy()
     ...: arr3 = arr.copy()
     ...: arr4 = arr.copy()
     ...: arr5 = arr.copy()
     ...: 

In [202]: %timeit func1(arr1)
     ...: %timeit func2(arr2)
     ...: %timeit func3(arr3)
     ...: %timeit func4(arr4)
     ...: %timeit func5(arr5)
     ...: 
10 loops, best of 3: 149 ms per loop
10 loops, best of 3: 90.5 ms per loop
10 loops, best of 3: 88.8 ms per loop
10 loops, best of 3: 88.5 ms per loop
10 loops, best of 3: 75.3 ms per loop

Broadcasting based one seems to be doing quite well!

like image 109
Divakar Avatar answered Jan 13 '23 22:01

Divakar