I have a pandas data frame df like this.
date id eng math sci
2021-08-01 00:00:37 23 4.0 5.0 7.0
2021-08-01 00:05:37 23 4.0 4.0 5.0
2021-08-01 00:10:37 23 4.0 4.0 6.0
2021-08-01 00:15:38 23 4.0 4.0 5.0
2021-08-01 00:20:37 23 4.0 5.0 6.0
2021-08-01 00:25:37 23 4.0 5.0 7.0
... ... ... ... ...
2021-08-31 23:38:40 1995 4.0 4.0 5.0
2021-08-31 23:43:40 1995 4.0 4.0 4.0
2021-08-31 23:48:40 1995 4.0 5.0 5.0
2021-08-31 23:53:40 1995 4.0 4.0 4.0
2021-08-31 23:58:40 1995 4.0 5.0 7.0
1661089 rows × 4 columns
I want to remove rows with the maximum sci value in each hour. In each hour, I would like to remove exactly 1 maximum sci value. If there are 2 maxmimum values in each hour like above case, remove just first row.
So the result should look like:
date id eng math sci
2021-08-01 00:05:37 23 4.0 4.0 5.0
2021-08-01 00:10:37 23 4.0 4.0 6.0
2021-08-01 00:15:38 23 4.0 4.0 5.0
2021-08-01 00:20:37 23 4.0 5.0 6.0
2021-08-01 00:25:37 23 4.0 5.0 7.0
... ... ... ... ...
2021-08-31 23:38:40 1995 4.0 4.0 5.0
2021-08-31 23:43:40 1995 4.0 4.0 4.0
2021-08-31 23:48:40 1995 4.0 5.0 5.0
2021-08-31 23:53:40 1995 4.0 4.0 4.0
My first attempt:
df_filtered = df.reset_index()
df_temp_max = (df_filtered.groupby(['id', pd.Grouper(key='date', freq='1H')])
.agg({'sci': 'max'})
.reset_index())
df_test_max = pd.Series(df_temp_max['sci'].values)
df_filtered.insert(5, 'sci_max', df_test_max, True)
I got:
date id eng math sci sci_max
0 2021-08-01 00:00:37 23 4.0 5.0 7.0 7.0
1 2021-08-01 00:05:37 23 4.0 4.0 5.0 7.0
2 2021-08-01 00:10:37 23 4.0 4.0 6.0 7.0
3 2021-08-01 00:15:38 23 4.0 4.0 5.0 7.0
4 2021-08-01 00:20:37 23 4.0 5.0 6.0 7.0
... ... ... ... ... ... ...
1661084 2021-08-31 23:38:40 1995 4.0 4.0 5.0 NaN
1661085 2021-08-31 23:43:40 1995 4.0 4.0 4.0 NaN
1661086 2021-08-31 23:48:40 1995 4.0 5.0 5.0 NaN
1661087 2021-08-31 23:53:40 1995 4.0 4.0 4.0 NaN
1661088 2021-08-31 23:58:40 1995 4.0 5.0 7.0 NaN
Of course, it's not true. There are so many NaN values.
I tried to using for loop, but it took too much time and if I remove one row, there was indexing error as well.
Could you help me to solve this problem, plase? Thank you so much!
Use idxmax instead of max to get the index to remove per group:
idx = df.groupby(['id', pd.Grouper(key='date', freq='H')])['sci'].idxmax()
out = df.drop(idx)
Output:
>>> idx
id date
23 2021-08-01 00:00:00 0
1995 2021-08-31 23:00:00 10
Name: sci, dtype: int64
>>> out
date id eng math sci
1 2021-08-01 00:05:37 23 4.0 4.0 5.0
2 2021-08-01 00:10:37 23 4.0 4.0 6.0
3 2021-08-01 00:15:38 23 4.0 4.0 5.0
4 2021-08-01 00:20:37 23 4.0 5.0 6.0
5 2021-08-01 00:25:37 23 4.0 5.0 7.0
6 2021-08-31 23:38:40 1995 4.0 4.0 5.0
7 2021-08-31 23:43:40 1995 4.0 4.0 4.0
8 2021-08-31 23:48:40 1995 4.0 5.0 5.0
9 2021-08-31 23:53:40 1995 4.0 4.0 4.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With