I have the following pandas DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame({"first_column": [0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0]})
>>> df
first_column
0 0
1 0
2 0
3 1
4 1
5 1
6 0
7 0
8 1
9 1
10 0
11 0
12 0
13 0
14 1
15 1
16 1
17 1
18 1
19 0
20 0
first_column
is a binary column of 0s and 1s. There are "clusters" of consecutive ones, which are always found in pairs of at least two.
My goal is to create a column which "counts" the number of rows of ones per group:
>>> df
first_column counts
0 0 0
1 0 0
2 0 0
3 1 3
4 1 3
5 1 3
6 0 0
7 0 0
8 1 2
9 1 2
10 0 0
11 0 0
12 0 0
13 0 0
14 1 5
15 1 5
16 1 5
17 1 5
18 1 5
19 0 0
20 0 0
This sounds like a job for df.loc()
, e.g. df.loc[df.first_column == 1]
...something
I'm just not sure how to take into account each individual "cluster" of ones, and how to label each of the unique clusters with the "row count".
How would one do this?
Here's one approach with NumPy's cumsum
and bincount
-
def cumsum_bincount(a):
# Append 0 & look for a [0,1] pattern. Form a binned array based off 1s groups
ids = a*(np.diff(np.r_[0,a])==1).cumsum()
# Get the bincount, index into the count with ids and finally mask out 0s
return a*np.bincount(ids)[ids]
Sample run -
In [88]: df['counts'] = cumsum_bincount(df.first_column.values)
In [89]: df
Out[89]:
first_column counts
0 0 0
1 0 0
2 0 0
3 1 3
4 1 3
5 1 3
6 0 0
7 0 0
8 1 2
9 1 2
10 0 0
11 0 0
12 0 0
13 0 0
14 1 5
15 1 5
16 1 5
17 1 5
18 1 5
19 0 0
20 0 0
Set the first 6
elems to be 1s
and then test out -
In [101]: df.first_column.values[:5] = 1
In [102]: df['counts'] = cumsum_bincount(df.first_column.values)
In [103]: df
Out[103]:
first_column counts
0 1 6
1 1 6
2 1 6
3 1 6
4 1 6
5 1 6
6 0 0
7 0 0
8 1 2
9 1 2
10 0 0
11 0 0
12 0 0
13 0 0
14 1 5
15 1 5
16 1 5
17 1 5
18 1 5
19 0 0
20 0 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With