Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: remove rows with max value in each hour

Tags:

python

pandas

I have a pandas data frame df like this.

date                id      eng  math sci
2021-08-01 00:00:37 23      4.0  5.0  7.0
2021-08-01 00:05:37 23      4.0  4.0  5.0
2021-08-01 00:10:37 23      4.0  4.0  6.0
2021-08-01 00:15:38 23      4.0  4.0  5.0
2021-08-01 00:20:37 23      4.0  5.0  6.0
2021-08-01 00:25:37 23      4.0  5.0  7.0
... ... ... ... ...
2021-08-31 23:38:40 1995    4.0  4.0  5.0
2021-08-31 23:43:40 1995    4.0  4.0  4.0
2021-08-31 23:48:40 1995    4.0  5.0  5.0
2021-08-31 23:53:40 1995    4.0  4.0  4.0
2021-08-31 23:58:40 1995    4.0  5.0  7.0
1661089 rows × 4 columns

I want to remove rows with the maximum sci value in each hour. In each hour, I would like to remove exactly 1 maximum sci value. If there are 2 maxmimum values in each hour like above case, remove just first row.

So the result should look like:

date                id      eng  math sci
2021-08-01 00:05:37 23      4.0  4.0  5.0
2021-08-01 00:10:37 23      4.0  4.0  6.0
2021-08-01 00:15:38 23      4.0  4.0  5.0
2021-08-01 00:20:37 23      4.0  5.0  6.0
2021-08-01 00:25:37 23      4.0  5.0  7.0
... ... ... ... ...
2021-08-31 23:38:40 1995    4.0  4.0  5.0
2021-08-31 23:43:40 1995    4.0  4.0  4.0
2021-08-31 23:48:40 1995    4.0  5.0  5.0
2021-08-31 23:53:40 1995    4.0  4.0  4.0

My first attempt:

df_filtered = df.reset_index()
df_temp_max = (df_filtered.groupby(['id', pd.Grouper(key='date', freq='1H')])
            .agg({'sci': 'max'})
            .reset_index())
df_test_max = pd.Series(df_temp_max['sci'].values)    
df_filtered.insert(5, 'sci_max', df_test_max, True)

I got:

        date                id      eng math  sci  sci_max
0       2021-08-01 00:00:37 23      4.0  5.0  7.0  7.0
1       2021-08-01 00:05:37 23      4.0  4.0  5.0  7.0
2       2021-08-01 00:10:37 23      4.0  4.0  6.0  7.0
3       2021-08-01 00:15:38 23      4.0  4.0  5.0  7.0
4       2021-08-01 00:20:37 23      4.0  5.0  6.0  7.0
... ... ... ... ... ... ...
1661084 2021-08-31 23:38:40 1995    4.0  4.0  5.0  NaN
1661085 2021-08-31 23:43:40 1995    4.0  4.0  4.0  NaN
1661086 2021-08-31 23:48:40 1995    4.0  5.0  5.0  NaN
1661087 2021-08-31 23:53:40 1995    4.0  4.0  4.0  NaN
1661088 2021-08-31 23:58:40 1995    4.0  5.0  7.0  NaN

Of course, it's not true. There are so many NaN values.

I tried to using for loop, but it took too much time and if I remove one row, there was indexing error as well.

Could you help me to solve this problem, plase? Thank you so much!

like image 366
dev_groot Avatar asked May 05 '26 07:05

dev_groot


1 Answers

Use idxmax instead of max to get the index to remove per group:

idx = df.groupby(['id', pd.Grouper(key='date', freq='H')])['sci'].idxmax()
out = df.drop(idx)

Output:

>>> idx
id    date               
23    2021-08-01 00:00:00     0
1995  2021-08-31 23:00:00    10
Name: sci, dtype: int64

>>> out
                 date    id  eng  math  sci
1 2021-08-01 00:05:37    23  4.0   4.0  5.0
2 2021-08-01 00:10:37    23  4.0   4.0  6.0
3 2021-08-01 00:15:38    23  4.0   4.0  5.0
4 2021-08-01 00:20:37    23  4.0   5.0  6.0
5 2021-08-01 00:25:37    23  4.0   5.0  7.0
6 2021-08-31 23:38:40  1995  4.0   4.0  5.0
7 2021-08-31 23:43:40  1995  4.0   4.0  4.0
8 2021-08-31 23:48:40  1995  4.0   5.0  5.0
9 2021-08-31 23:53:40  1995  4.0   4.0  4.0
like image 120
Corralien Avatar answered May 06 '26 20:05

Corralien



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!