Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding start and stops of consecutive values block in Python/Numpy/Pandas

I want to find the starts and stops indexes of blocks of identical values in a numpy array or preferably a pandas DataFrame (blocks along the column for a 2D array, and along the most quickly varying index for a n - dimensional array). I only look for blocks on a single dimension and don't want to agregate nans on different rows.

Starting from that question (Find large number of consecutive values fulfilling condition in a numpy array), I wrote the following solution finding np.nan for a 2D array :

import numpy as np
a = np.array([
        [1, np.nan, np.nan, 2],
        [np.nan, 1, np.nan, 3], 
        [np.nan, np.nan, np.nan, np.nan]
    ])

nan_mask = np.isnan(a)
start_nans_mask = np.hstack((np.resize(nan_mask[:,0],(a.shape[0],1)),
                             np.logical_and(np.logical_not(nan_mask[:,:-1]), nan_mask[:,1:])
                             ))
stop_nans_mask = np.hstack((np.logical_and(nan_mask[:,:-1], np.logical_not(nan_mask[:,1:])),
                            np.resize(nan_mask[:,-1], (a.shape[0],1))
                            ))

start_row_idx,start_col_idx = np.where(start_nans_mask)
stop_row_idx,stop_col_idx = np.where(stop_nans_mask)

This lets me for example analyze the distribution of length of patches of missing values before applying pd.fillna.

stop_col_idx - start_col_idx + 1
array([2, 1, 1, 4], dtype=int64)

One more example and the expecting result :

a = np.array([
        [1, np.nan, np.nan, 2],
        [np.nan, 1, np.nan, np.nan], 
        [np.nan, np.nan, np.nan, np.nan]
    ])

array([2, 1, 2, 4], dtype=int64)

and not

array([2, 1, 6], dtype=int64)

My questions are the following :

  • Is there a way to optimize my solution (finding starts and ends in a single pass of mask/where operations)?
  • Is there a more optimized solution in pandas? (i.e. different solution than just applying mask/where on the DataFrame's values)
  • What happens when the underlying array or DataFrame is to big to fit in memory?
like image 614
Guillaume Avatar asked Dec 26 '22 08:12

Guillaume


1 Answers

I loaded your np.array into a dataframe:

In [26]: df
Out[26]:
    0   1   2   3
0   1 NaN NaN   2
1 NaN   1 NaN   2
2 NaN NaN NaN NaN

Then transposed and turned it into a series. I think this is similar to np.hstack:

In [28]: s = df.T.unstack(); s
Out[28]:
0  0     1
   1   NaN
   2   NaN
   3     2
1  0   NaN
   1     1
   2   NaN
   3     2
2  0   NaN
   1   NaN
   2   NaN
   3   NaN

This expression creates a Series where the numbers represent blocks incrementing by 1 for every non-null value:

In [29]: s.notnull().astype(int).cumsum()
Out[29]:
0  0    1
   1    1
   2    1
   3    2
1  0    2
   1    3
   2    3
   3    4
2  0    4
   1    4
   2    4
   3    4

This expression creates a Series where every nan is a 1 and everything else is a zero:

In [31]: s.isnull().astype(int)
Out[31]:
0  0    0
   1    1
   2    1
   3    0
1  0    1
   1    0
   2    1
   3    0
2  0    1
   1    1
   2    1
   3    1

We can combine the two in the following manner to achieve the counts you need:

In [32]: s.isnull().astype(int).groupby(s.notnull().astype(int).cumsum()).sum()
Out[32]:
1    2
2    1
3    1
4    4
like image 163
Zelazny7 Avatar answered Dec 28 '22 21:12

Zelazny7