Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way of filtering by datetime in groupby

Given the DataFrame generated by:

import numpy as np
import pandas as pd
from datetime import timedelta

np.random.seed(0)
rng = pd.date_range('2015-02-24', periods=14, freq='9H')
ids = [1]*5 + [2]*2 + [3]*7
df = pd.DataFrame({'id': ids, 'time_entered': rng, 'val': np.random.randn(len(rng))})

df:

    id  time_entered        val
0   1   2015-02-24 00:00:00 1.764052
1   1   2015-02-24 09:00:00 0.400157
2   1   2015-02-24 18:00:00 0.978738
3   1   2015-02-25 03:00:00 2.240893
4   1   2015-02-25 12:00:00 1.867558
5   2   2015-02-25 21:00:00 -0.977278
6   2   2015-02-26 06:00:00 0.950088
7   3   2015-02-26 15:00:00 -0.151357
8   3   2015-02-27 00:00:00 -0.103219
9   3   2015-02-27 09:00:00 0.410599
10  3   2015-02-27 18:00:00 0.144044
11  3   2015-02-28 03:00:00 1.454274
12  3   2015-02-28 12:00:00 0.761038
13  3   2015-02-28 21:00:00 0.121675

I need to, for each id, remove rows which are more than 24hours (1 day) from the latest time_entered, for that id. My current solution:

def custom_transform(x):
    datetime_from = x["time_entered"].max() - timedelta(days=1)
    x = x[x["time_entered"] > datetime_from]
    return x

df.groupby("id").apply(lambda x: custom_transform(x)).reset_index(drop=True)

which gives the correct, expected, output:

    id  time_entered        val
0   1   2015-02-24 18:00:00 0.978738
1   1   2015-02-25 03:00:00 2.240893
2   1   2015-02-25 12:00:00 1.867558
3   2   2015-02-25 21:00:00 -0.977278
4   2   2015-02-26 06:00:00 0.950088
5   3   2015-02-28 03:00:00 1.454274
6   3   2015-02-28 12:00:00 0.761038
7   3   2015-02-28 21:00:00 0.121675

However, my real data is tens of millions of rows, and hundreds of thousands of unique ids, because of this, this solution is infeasible (takes very long time).

Is there a more efficient way to filter the data? I appreciate all ideas!

like image 955
Marcus Avatar asked Oct 22 '20 13:10

Marcus


People also ask

How do I filter by datetime?

To filter rows based on dates, first format the dates in the DataFrame to datetime64 type. Then use the DataFrame. loc[] and DataFrame. query[] function from the Pandas package to specify a filter condition.

Is pandas GROUP BY efficient?

Groupby is a very popular function in Pandas. This is very good at summarising, transforming, filtering, and a few other very essential data analysis tasks.

What is possible using GROUP BY () method of pandas?

groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names. sort : Sort group keys.


1 Answers

Generally, avoid groupby().apply() since it's not vectorized across groups, not to mention the overhead for memory allocation if you are returning new dataframes as in your case.

How about finding the time threshold with groupby().transform then use boolean indexing on the whole data:

time_max_by_id = df.groupby('id')['time_entered'].transform('max') - pd.Timedelta('1D')
df[df['time_entered'] > time_max_by_id]

Output:

    id        time_entered       val
2    1 2015-02-24 18:00:00  0.978738
3    1 2015-02-25 03:00:00  2.240893
4    1 2015-02-25 12:00:00  1.867558
5    2 2015-02-25 21:00:00 -0.977278
6    2 2015-02-26 06:00:00  0.950088
11   3 2015-02-28 03:00:00  1.454274
12   3 2015-02-28 12:00:00  0.761038
13   3 2015-02-28 21:00:00  0.121675
like image 95
Quang Hoang Avatar answered Oct 01 '22 02:10

Quang Hoang