df=pd.DataFrame({"C1":['USA','USA','USA','USA','USA','JAPAN','JAPAN','JAPAN','USA','USA'],'C2':['A','B','A','A','A','A','A','A','B','A']})
C1 C2
0 USA A
1 USA B
2 USA A
3 USA A
4 USA A
5 JAPAN A
6 JAPAN A
7 JAPAN A
8 USA B
9 USA A
This is a watered version of my problem so to keep it simple, my objective is to iterate a sub group of the dataframe where C2 has B in it. If a B is in C2 - I look at C1 and need the entire group. So in this example, I see USA and it starts at index 0 and finish at 4. Another one is between 8 and 9.
So my desired result would be the indexes such that:
[[0,4],[8,9]]
I tried to use groupby but it wouldn't work because it groups all the USA together
my_index = list(df[df['C2']=='B'].index)
my_index
woudld give 1,8 but how to get the start/finish?
You can use the following basic syntax to use GroupBy on a pandas DataFrame with a multiindex: #calculate sum by level 0 and 1 of multiindex df. groupby(level=[0,1]). sum() #calculate count by level 0 and 1 of multiindex df.
To reset index after group by, at first group according to a column using groupby(). After that, use reset_index().
To get the index of a Pandas DataFrame, call DataFrame. index property. The DataFrame. index property returns an Index object representing the index of this DataFrame.
Here is one approach where you can first mask the dataframe on groups which has atleast 1 B
, then grab the index and create a helper column to aggregate the first and last index values:
s = df['C1'].ne(df['C1'].shift()).cumsum()
i = df.index[s.isin(s[df['C2'].eq("B")])]
p = np.where(np.diff(i)>1)[0]+1
split_ = np.split(i,p)
out = [[i[0],i[-1]] for i in split_]
print(out)
[[0, 4], [8, 9]]
b = df['C1'].ne(df['C1'].shift()).cumsum()
m = b.isin(b[df['C2'].eq('B')])
i = m.index[m].to_series().groupby(b).agg(['first', 'last']).values.squeeze()
shift
column C1
and comapre the shifted column with the non-shifted one to create a boolean mask then take a cumulative
sum on this mask to identify the blocks of rows where the value in column C1
stays the same
>>> b
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 3
9 3
Name: C1, dtype: int64
Create a boolean mask m
to identify the blocks of rows that contain at least on B
>>> m
0 True
1 True
2 True
3 True
4 True
5 False
6 False
7 False
8 True
9 True
Name: C1, dtype: bool
Filter the index
by using boolean masking with mask m
, then group
the filtered index by the identified blocks b
and aggregate using first
and last
to get the indices.
>>> i
array([[0, 4],
[8, 9]])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With