Let's say I have the following Pandas DataFrame
:
id | a1 | a2 | a3 | a4
1 | 3 | 0 | 10 | 25
2 | 0 | 0 | 31 | 15
3 | 20 | 11 | 6 | 5
4 | 0 | 3 | 1 | 7
What I want is to calculate the number of non-overlapping runs of n
consecutive non-zero values in each row, for various values of n
. The desired output would be:
id | a1 | a2 | a3 | a4 | 2s | 3s | 4s
1 | 3 | 0 | 10 | 25 | 1 | 0 | 0
2 | 0 | 0 | 31 | 15 | 1 | 0 | 0
3 | 20 | 11 | 6 | 5 | 2 | 1 | 1
4 | 0 | 3 | 1 | 7 | 1 | 1 | 0
where e.g. each value in the 2s
column shows the number of non-overlapping runs of length 2 in that row, each value in the 3s
column shows the corresponding number of runs of length 3, and so on.
I wonder if there are any Pandas or Numpy methods to take care of this?
Here's one approach with 2D convolution
to solve for any number of elements in a row -
from scipy.signal import convolve2d as conv2
n = 6
v = np.vstack([(conv2(df.values!=0,[[1]*I])==I).sum(1) for I in range(2,n+1)]).T
df_v = pd.DataFrame(v, columns = [[str(i)+'s' for i in range(2,n+1)]])
df_out = pd.concat([df, df_v],1)
Basic idea
The basic idea is that we could use a sliding window for summing the presence of non-zeros in each row. Let's say we are looking to see how many three non-zeros occur consecutively. So, we will use a sliding window of size 3
and get the sliding summations. All those places where the sliding windows have all three elements occurring as non-zeros would produce a summation of 3
. So, we look for summations that match 3
and count those. That's it! We loop through all windows sizes to catch all of 2s
, 3s
, etc.
Here's a sample run to count for 3s
on an array -
In [326]: a
Out[326]:
array([[0, 2, 1, 2, 1, 2],
[2, 2, 2, 0, 0, 0],
[2, 2, 1, 1, 1, 1],
[1, 2, 1, 2, 0, 1]])
In [327]: a!=0
Out[327]:
array([[False, True, True, True, True, True],
[ True, True, True, False, False, False],
[ True, True, True, True, True, True],
[ True, True, True, True, False, True]], dtype=bool)
In [329]: conv2(a!=0,[[1]*3])
Out[329]:
array([[0, 1, 2, 3, 3, 3, 2, 1],
[1, 2, 3, 2, 1, 0, 0, 0],
[1, 2, 3, 3, 3, 3, 2, 1],
[1, 2, 3, 3, 2, 2, 1, 1]])
In [330]: conv2(a!=0,[[1]*3])==3
Out[330]:
array([[False, False, False, True, True, True, False, False],
[False, False, True, False, False, False, False, False],
[False, False, True, True, True, True, False, False],
[False, False, True, True, False, False, False, False]], dtype=bool)
In [331]: (conv2(a!=0,[[1]*3])==3).sum(1)
Out[331]: array([3, 1, 4, 2])
Sample run -
In [158]: df_out
Out[158]:
a1 a2 a3 a4 a5 a6 2s 3s 4s 5s 6s
0 1 2 1 0 0 2 2 1 0 0 0
1 1 1 2 1 0 1 3 2 1 0 0
2 1 1 0 0 1 1 2 0 0 0 0
3 2 2 1 0 2 2 3 1 0 0 0
Please note that if the first column is 'id'
, then we need to skip it. So, we need to use df.values[:,1:]
instead of df.values
in the proposed solution code.
A solution managing the non overlapping feature.
def count(row,mins):
runs=(row!=0).astype(uint8).tobytes().decode().split(chr(0))
lengths=[len(run) for run in runs]
return np.floor_divide.outer(lengths,mins).sum(0)
It use strings fast operations to find all the runs, then use // to find how many non overlapping runs of given length you can build in each.
with df:
a1 a2 a3 a4
id
1 3 0 10 25
2 0 0 31 15
3 20 11 6 5
4 0 3 1 7
np.apply_along_axis(count,1,df,[2,3,4])
returns
array([[1, 0, 0],
[1, 0, 0],
[2, 1, 1],
[1, 1, 0]], dtype=int32)
which is the expected result for df
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With