Efficient way of filtering by datetime in groupby

Tags:

Given the DataFrame generated by:

import numpy as np
import pandas as pd
from datetime import timedelta

np.random.seed(0)
rng = pd.date_range('2015-02-24', periods=14, freq='9H')
ids = [1]*5 + [2]*2 + [3]*7
df = pd.DataFrame({'id': ids, 'time_entered': rng, 'val': np.random.randn(len(rng))})

df:

    id  time_entered        val
0   1   2015-02-24 00:00:00 1.764052
1   1   2015-02-24 09:00:00 0.400157
2   1   2015-02-24 18:00:00 0.978738
3   1   2015-02-25 03:00:00 2.240893
4   1   2015-02-25 12:00:00 1.867558
5   2   2015-02-25 21:00:00 -0.977278
6   2   2015-02-26 06:00:00 0.950088
7   3   2015-02-26 15:00:00 -0.151357
8   3   2015-02-27 00:00:00 -0.103219
9   3   2015-02-27 09:00:00 0.410599
10  3   2015-02-27 18:00:00 0.144044
11  3   2015-02-28 03:00:00 1.454274
12  3   2015-02-28 12:00:00 0.761038
13  3   2015-02-28 21:00:00 0.121675

I need to, for each id, remove rows which are more than 24hours (1 day) from the latest time_entered, for that id. My current solution:

def custom_transform(x):
    datetime_from = x["time_entered"].max() - timedelta(days=1)
    x = x[x["time_entered"] > datetime_from]
    return x

df.groupby("id").apply(lambda x: custom_transform(x)).reset_index(drop=True)

which gives the correct, expected, output:

    id  time_entered        val
0   1   2015-02-24 18:00:00 0.978738
1   1   2015-02-25 03:00:00 2.240893
2   1   2015-02-25 12:00:00 1.867558
3   2   2015-02-25 21:00:00 -0.977278
4   2   2015-02-26 06:00:00 0.950088
5   3   2015-02-28 03:00:00 1.454274
6   3   2015-02-28 12:00:00 0.761038
7   3   2015-02-28 21:00:00 0.121675

However, my real data is tens of millions of rows, and hundreds of thousands of unique ids, because of this, this solution is infeasible (takes very long time).

Is there a more efficient way to filter the data? I appreciate all ideas!

955

asked Oct 22 '20 13:10

Marcus

1 Answers

Generally, avoid groupby().apply() since it's not vectorized across groups, not to mention the overhead for memory allocation if you are returning new dataframes as in your case.

How about finding the time threshold with groupby().transform then use boolean indexing on the whole data:

time_max_by_id = df.groupby('id')['time_entered'].transform('max') - pd.Timedelta('1D')
df[df['time_entered'] > time_max_by_id]

Output:

    id        time_entered       val
2    1 2015-02-24 18:00:00  0.978738
3    1 2015-02-25 03:00:00  2.240893
4    1 2015-02-25 12:00:00  1.867558
5    2 2015-02-25 21:00:00 -0.977278
6    2 2015-02-26 06:00:00  0.950088
11   3 2015-02-28 03:00:00  1.454274
12   3 2015-02-28 12:00:00  0.761038
13   3 2015-02-28 21:00:00  0.121675

answered Oct 01 '22 02:10

Quang Hoang

Related questions
                            
                                Find Fraction using LP
                            
                                Training stability of Wasserstein GANs
                            
                                Detecting insertion/removal of USB input devices on Windows 10
                            
                                TensorFlow 2.0 C++ - Load pre-trained model
                            
                                how to increase resolution of text in scanned images in python?
                            
                                matplotlib figure won't show when Python is run from VS Code integrated terminal
                            
                                ImportError: cannot import name 'Feature' from 'setuptools [closed]
                            
                                how to add a different model form to modelformset_factory
                            
                                tensorflow_hub throwing this error: 'SentencepieceOp' when loading the link
                            
                                Why multiprocess python grpc server do not work?
                            
                                ValueError: Expect x to be a 1-D sorted array_like.I am trying to plot smooth curve but couldn't
                            
                                Calling a function with unknown number of parameters Python
                            
                                How to use an optimization algorithm to find the best possible parameter
                            
                                What's the computational complexity of .iloc() in pandas dataframes?
                            
                                Create a new camera source using OpenCV Python (Camera Driver Using Python)
                            
                                ModuleNotFoundError: No module named 'tensorflow_hub'
                            
                                Numpy finding interval which has a least k points
                            
                                How to increase AWS Sagemaker invocation time out while waiting for a response
                            
                                SVC vs LinearSVC in scikit learn: difference of loss function
                            
                                Is there *any* solution to packaging a python app that uses cppyy?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficient way of filtering by datetime in groupby

Tags:

python

optimization

pandas

numpy

pandas-groupby

Marcus

People also ask

1 Answers

Quang Hoang

Recent Activity

Donate For Us