Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split dataframe by two repeated values

I have a dataframe which describes the status of a person:

df = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6, 7, 8, 3], 
                  'B': [6, 7, 8, 9, 10, 23, 11, 12, 13], 
                  'C': ['start', 'running', 'running', 'end', 'running', 'start', 'running', 'resting', 'end']})

This dataframe records two trips of the person. I want to split it based on the values of column C, 'start' and 'end'. The other values in column C do not matter.

I could divide the dataframe by the following codes:

x=[]
y=[]

for i in range(len(df)):
    if df['C'][i]=='start':
        x.append(i)
    elif df['C'][i]=='end':
        y.append(i)

for i, j in zip(x, y):
    new_df = df.iloc[i:j+1,:]
    print(new_df)

However, I'm wondering is there any more efficient way to divide it without loop since I have a pretty large dataframe.

like image 374
Pei Li Avatar asked Mar 29 '26 01:03

Pei Li


2 Answers

I would create a dict using GroupBy.__iter__()

Method 1

start = df['C'].eq('start')
dfs = dict(df.loc[(start.add(df['C'].shift().eq('end')).cumsum()%2).eq(1)]
             .groupby(start.cumsum())
             .__iter__())

#{1:    A  B        C
# 0  1  6    start
# 1  2  7  running
# 2  3  8  running
# 3  4  9      end, 2:    A   B        C
# 5  6  23    start
# 6  7  11  running
# 7  8  12  resting
# 8  3  13      end}

Method 2

start = df['C'].eq('start')
dfs = dict(df.loc[start.where(start)
                       .groupby(df['C'].shift()
                                       .eq('end')
                                       .cumsum())
                       .ffill().notna()]
             .groupby(start.cumsum())
             .__iter__())

#{1:    A  B        C
# 0  1  6    start
# 1  2  7  running
# 2  3  8  running
# 3  4  9      end, 2:    A   B        C
# 5  6  23    start
# 6  7  11  running
# 7  8  12  resting
# 8  3  13      end}

Accessing DataFrame

print(dfs[1])
   A  B        C
0  1  6    start
1  2  7  running
2  3  8  running
3  4  9      end

print(dfs[2])
   A   B        C
5  6  23    start
6  7  11  running
7  8  12  resting
8  3  13      end

We can use groupby.get_group

dfs = (df.loc[start.where(start)
                   .groupby(df['C'].shift()
                                   .eq('end')
                                   .cumsum())
                       .ffill().notna()]
          .groupby(start.cumsum()))
df1=dfs.get_group(1)
df2=dfs.get_group(2) 
print(df1)
print(df2)

Details Method 2

start.where(start)
0    1.0
1    NaN
2    NaN
3    NaN
4    NaN
5    1.0
6    NaN
7    NaN
8    NaN
Name: C, dtype: float64

df['C'].shift().eq('end').cumsum()


0    0
1    0
2    0
3    0
4    1
5    1
6    1
7    1
8    1
Name: C, dtype: int64

as you can see row 4 is within group 1, and when using groupby.ffill its value remains NaN

like image 74
ansev Avatar answered Mar 31 '26 07:03

ansev


Based on the comments, the starting dataframe:

df = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6, 7, 8, 3],
                  'B': [6, 7, 8, 9, 10, 23, 11, 12, 13],
                  'C': ['start', 'running', 'running', 'end', 'running', 'start', 'running', 'resting', 'end']})

Then:

for g in df.groupby(df.assign(tmp=(df['C'] == 'start'))['tmp'].cumsum()):
    m = (g[1]['C'] == 'end').shift().fillna(False).cumsum() == 0
    print(g[1][m])

Prints:

   A  B        C
0  1  6    start
1  2  7  running
2  3  8  running
3  4  9      end
   A   B        C
5  6  23    start
6  7  11  running
7  8  12  resting
8  3  13      end
like image 40
Andrej Kesely Avatar answered Mar 31 '26 07:03

Andrej Kesely



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!