Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to get a subgroup start finish indexes of dataframe

Tags:

python

pandas

df=pd.DataFrame({"C1":['USA','USA','USA','USA','USA','JAPAN','JAPAN','JAPAN','USA','USA'],'C2':['A','B','A','A','A','A','A','A','B','A']})

    C1      C2
0   USA     A
1   USA     B
2   USA     A
3   USA     A
4   USA     A
5   JAPAN   A
6   JAPAN   A
7   JAPAN   A
8   USA     B
9   USA     A

This is a watered version of my problem so to keep it simple, my objective is to iterate a sub group of the dataframe where C2 has B in it. If a B is in C2 - I look at C1 and need the entire group. So in this example, I see USA and it starts at index 0 and finish at 4. Another one is between 8 and 9.

So my desired result would be the indexes such that:

[[0,4],[8,9]] 

I tried to use groupby but it wouldn't work because it groups all the USA together

my_index = list(df[df['C2']=='B'].index)
my_index

woudld give 1,8 but how to get the start/finish?

like image 269
ProcolHarum Avatar asked Apr 18 '21 16:04

ProcolHarum


People also ask

How do you get Groupby MultiIndex Pandas?

You can use the following basic syntax to use GroupBy on a pandas DataFrame with a multiindex: #calculate sum by level 0 and 1 of multiindex df. groupby(level=[0,1]). sum() #calculate count by level 0 and 1 of multiindex df.

How do you get index after Groupby Pandas?

To reset index after group by, at first group according to a column using groupby(). After that, use reset_index().

How do I retrieve a DataFrame index?

To get the index of a Pandas DataFrame, call DataFrame. index property. The DataFrame. index property returns an Index object representing the index of this DataFrame.


2 Answers

Here is one approach where you can first mask the dataframe on groups which has atleast 1 B, then grab the index and create a helper column to aggregate the first and last index values:

s = df['C1'].ne(df['C1'].shift()).cumsum()
i = df.index[s.isin(s[df['C2'].eq("B")])]
p = np.where(np.diff(i)>1)[0]+1
split_ = np.split(i,p)
out = [[i[0],i[-1]] for i in split_]

print(out)
[[0, 4], [8, 9]]
like image 101
anky Avatar answered Sep 27 '22 19:09

anky


Solution

b = df['C1'].ne(df['C1'].shift()).cumsum()
m = b.isin(b[df['C2'].eq('B')])
i = m.index[m].to_series().groupby(b).agg(['first', 'last']).values.squeeze()

Explanations

shift column C1 and comapre the shifted column with the non-shifted one to create a boolean mask then take a cumulative sum on this mask to identify the blocks of rows where the value in column C1 stays the same

>>> b

0    1
1    1
2    1
3    1
4    1
5    2
6    2
7    2
8    3
9    3
Name: C1, dtype: int64

Create a boolean mask m to identify the blocks of rows that contain at least on B

>>> m

0     True
1     True
2     True
3     True
4     True
5    False
6    False
7    False
8     True
9     True
Name: C1, dtype: bool

Filter the index by using boolean masking with mask m, then group the filtered index by the identified blocks b and aggregate using first and last to get the indices.

>>> i

array([[0, 4],
       [8, 9]])
like image 44
Shubham Sharma Avatar answered Sep 27 '22 19:09

Shubham Sharma