I have a dataframe which describes the status of a person:
df = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6, 7, 8, 3],
'B': [6, 7, 8, 9, 10, 23, 11, 12, 13],
'C': ['start', 'running', 'running', 'end', 'running', 'start', 'running', 'resting', 'end']})
This dataframe records two trips of the person. I want to split it based on the values of column C, 'start' and 'end'. The other values in column C do not matter.
I could divide the dataframe by the following codes:
x=[]
y=[]
for i in range(len(df)):
if df['C'][i]=='start':
x.append(i)
elif df['C'][i]=='end':
y.append(i)
for i, j in zip(x, y):
new_df = df.iloc[i:j+1,:]
print(new_df)
However, I'm wondering is there any more efficient way to divide it without loop since I have a pretty large dataframe.
I would create a dict using GroupBy.__iter__()
Method 1
start = df['C'].eq('start')
dfs = dict(df.loc[(start.add(df['C'].shift().eq('end')).cumsum()%2).eq(1)]
.groupby(start.cumsum())
.__iter__())
#{1: A B C
# 0 1 6 start
# 1 2 7 running
# 2 3 8 running
# 3 4 9 end, 2: A B C
# 5 6 23 start
# 6 7 11 running
# 7 8 12 resting
# 8 3 13 end}
Method 2
start = df['C'].eq('start')
dfs = dict(df.loc[start.where(start)
.groupby(df['C'].shift()
.eq('end')
.cumsum())
.ffill().notna()]
.groupby(start.cumsum())
.__iter__())
#{1: A B C
# 0 1 6 start
# 1 2 7 running
# 2 3 8 running
# 3 4 9 end, 2: A B C
# 5 6 23 start
# 6 7 11 running
# 7 8 12 resting
# 8 3 13 end}
Accessing DataFrame
print(dfs[1])
A B C
0 1 6 start
1 2 7 running
2 3 8 running
3 4 9 end
print(dfs[2])
A B C
5 6 23 start
6 7 11 running
7 8 12 resting
8 3 13 end
We can use groupby.get_group
dfs = (df.loc[start.where(start)
.groupby(df['C'].shift()
.eq('end')
.cumsum())
.ffill().notna()]
.groupby(start.cumsum()))
df1=dfs.get_group(1)
df2=dfs.get_group(2)
print(df1)
print(df2)
Details Method 2
start.where(start)
0 1.0
1 NaN
2 NaN
3 NaN
4 NaN
5 1.0
6 NaN
7 NaN
8 NaN
Name: C, dtype: float64
df['C'].shift().eq('end').cumsum()
0 0
1 0
2 0
3 0
4 1
5 1
6 1
7 1
8 1
Name: C, dtype: int64
as you can see row 4 is within group 1, and when using groupby.ffill its value remains NaN
Based on the comments, the starting dataframe:
df = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6, 7, 8, 3],
'B': [6, 7, 8, 9, 10, 23, 11, 12, 13],
'C': ['start', 'running', 'running', 'end', 'running', 'start', 'running', 'resting', 'end']})
Then:
for g in df.groupby(df.assign(tmp=(df['C'] == 'start'))['tmp'].cumsum()):
m = (g[1]['C'] == 'end').shift().fillna(False).cumsum() == 0
print(g[1][m])
Prints:
A B C
0 1 6 start
1 2 7 running
2 3 8 running
3 4 9 end
A B C
5 6 23 start
6 7 11 running
7 8 12 resting
8 3 13 end
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With