Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Counting non-overlapping runs of non-zero values by row in a DataFrame

Let's say I have the following Pandas DataFrame:

id | a1 | a2 | a3 | a4 
1  | 3  | 0  | 10 | 25   
2  | 0  | 0  | 31 | 15  
3  | 20 | 11 | 6  | 5  
4  | 0  | 3  | 1  | 7  

What I want is to calculate the number of non-overlapping runs of n consecutive non-zero values in each row, for various values of n. The desired output would be:

id | a1 | a2 | a3 | a4 | 2s | 3s | 4s
1  | 3  | 0  | 10 | 25 | 1  | 0  | 0
2  | 0  | 0  | 31 | 15 | 1  | 0  | 0
3  | 20 | 11 | 6  | 5  | 2  | 1  | 1
4  | 0  | 3  | 1  | 7  | 1  | 1  | 0

where e.g. each value in the 2s column shows the number of non-overlapping runs of length 2 in that row, each value in the 3s column shows the corresponding number of runs of length 3, and so on.

I wonder if there are any Pandas or Numpy methods to take care of this?

like image 743
renakre Avatar asked Jan 29 '17 09:01

renakre


2 Answers

Here's one approach with 2D convolution to solve for any number of elements in a row -

from scipy.signal import convolve2d as conv2

n = 6
v = np.vstack([(conv2(df.values!=0,[[1]*I])==I).sum(1) for I in range(2,n+1)]).T
df_v = pd.DataFrame(v, columns = [[str(i)+'s' for i in range(2,n+1)]])
df_out = pd.concat([df, df_v],1)

Basic idea

The basic idea is that we could use a sliding window for summing the presence of non-zeros in each row. Let's say we are looking to see how many three non-zeros occur consecutively. So, we will use a sliding window of size 3 and get the sliding summations. All those places where the sliding windows have all three elements occurring as non-zeros would produce a summation of 3. So, we look for summations that match 3 and count those. That's it! We loop through all windows sizes to catch all of 2s, 3s, etc.

Here's a sample run to count for 3s on an array -

In [326]: a
Out[326]: 
array([[0, 2, 1, 2, 1, 2],
       [2, 2, 2, 0, 0, 0],
       [2, 2, 1, 1, 1, 1],
       [1, 2, 1, 2, 0, 1]])

In [327]: a!=0
Out[327]: 
array([[False,  True,  True,  True,  True,  True],
       [ True,  True,  True, False, False, False],
       [ True,  True,  True,  True,  True,  True],
       [ True,  True,  True,  True, False,  True]], dtype=bool)

In [329]: conv2(a!=0,[[1]*3])
Out[329]: 
array([[0, 1, 2, 3, 3, 3, 2, 1],
       [1, 2, 3, 2, 1, 0, 0, 0],
       [1, 2, 3, 3, 3, 3, 2, 1],
       [1, 2, 3, 3, 2, 2, 1, 1]])

In [330]: conv2(a!=0,[[1]*3])==3
Out[330]: 
array([[False, False, False,  True,  True,  True, False, False],
       [False, False,  True, False, False, False, False, False],
       [False, False,  True,  True,  True,  True, False, False],
       [False, False,  True,  True, False, False, False, False]], dtype=bool)

In [331]: (conv2(a!=0,[[1]*3])==3).sum(1)
Out[331]: array([3, 1, 4, 2])

Sample run -

In [158]: df_out
Out[158]: 
   a1  a2  a3  a4  a5  a6  2s  3s  4s  5s  6s
0   1   2   1   0   0   2   2   1   0   0   0
1   1   1   2   1   0   1   3   2   1   0   0
2   1   1   0   0   1   1   2   0   0   0   0
3   2   2   1   0   2   2   3   1   0   0   0

Please note that if the first column is 'id', then we need to skip it. So, we need to use df.values[:,1:] instead of df.values in the proposed solution code.

like image 165
Divakar Avatar answered Oct 21 '22 07:10

Divakar


A solution managing the non overlapping feature.

def count(row,mins):
    runs=(row!=0).astype(uint8).tobytes().decode().split(chr(0))
    lengths=[len(run) for run in runs]
    return np.floor_divide.outer(lengths,mins).sum(0) 

It use strings fast operations to find all the runs, then use // to find how many non overlapping runs of given length you can build in each.

with df:

    a1  a2  a3  a4
id                
1    3   0  10  25
2    0   0  31  15
3   20  11   6   5
4    0   3   1   7

np.apply_along_axis(count,1,df,[2,3,4]) returns

array([[1, 0, 0],
       [1, 0, 0],
       [2, 1, 1],
       [1, 1, 0]], dtype=int32)

which is the expected result for df.

like image 38
B. M. Avatar answered Oct 21 '22 07:10

B. M.