Find group of consecutive dates in Pandas DataFrame

Tags:

I am trying to get the chunks of data where there's consecutive dates from the Pandas DataFrame. My df looks like below.

      DateAnalyzed           Val
1       2018-03-18      0.470253
2       2018-03-19      0.470253
3       2018-03-20      0.470253
4       2018-09-25      0.467729
5       2018-09-26      0.467729
6       2018-09-27      0.467729

In this df, I want to get the first 3 rows, do some processing and then get the last 3 rows and do processing on that.

I calculated the difference with 1 lag by applying following code.

df['Delta']=(df['DateAnalyzed'] - df['DateAnalyzed'].shift(1))

But after then I can't figure out that how to get the groups of consecutive rows without iterating.

959

asked Oct 20 '18 00:10

smm

1 Answers

There were similar questions after this one here and here, with more specific outputs requirements. Since this one is more general, I would like to contribute here as well.

We can easily assign an unique identifier to consecutive groups with one-line code:

df['grp_date'] = df.DateAnalyzed.diff().dt.days.ne(1).cumsum()

Here, every time we see a date with a difference greater than a day, we add a value to that date, otherwise it remains with the previous value so that we end up with a unique identifier per group.

See the output:

  DateAnalyzed       Val  grp_date
1   2018-03-18  0.470253         1
2   2018-03-19  0.470253         1
3   2018-03-20  0.470253         1
4   2018-09-25  0.467729         2
5   2018-09-26  0.467729         2
6   2018-09-27  0.467729         2

Now, it's easy to groupby "grp_date" and do whatever you wanna do with apply or agg.

Examples:

# Sum across consecutive days (or any other method from pandas groupby)
df.groupby('grp_date').sum()

# Get the first value and last value per consecutive days
df.groupby('grp_date').apply(lambda x: x.iloc[[0, -1]])
# or df.groupby('grp_date').head(n) for first n days

# Perform custom operation across target-columns
df.groupby('grp_date').apply(lambda x: (x['col1'] + x['col2']) / x['Val'].mean())

# Multiple operations for a target-column
df.groupby('grp_date').Val.agg(['min', 'max', 'mean', 'std'])

# and so on...

answered Sep 17 '22 00:09

Cainã Max Couto-Silva

Related questions
                            
                                Passing a command line argument to airflow BashOperator
                            
                                pandas diff() giving 0 value for first difference, I want the actual value instead
                            
                                ValueError: Unknown label type: 'continuous'
                            
                                Generate decreasing list of integers using python range
                            
                                Avoid overflow with softplus function in python
                            
                                Django registration of tag library not working
                            
                                Python Pandas - select dataframe columns where equals
                            
                                Python NamedTemporaryFile appears empty even after data is written
                            
                                How to filter by multiple criteria in Flask SQLAlchemy?
                            
                                Multiples-keys dictionary where key order doesn't matter
                            
                                Using __add__ operator with multiple arguments in Python
                            
                                Seaborn pairplot ValueError: max must be larger than min in range parameter
                            
                                Sum columns by level in a pandas MultiIndex DataFrame
                            
                                Unittest on AWS Lambda
                            
                                Python flask-cors ImportError: No module named 'flask-cors' Raspberry pi
                            
                                Py4J error when creating a spark dataframe using pyspark
                            
                                Initialise a NumPy array based on its index
                            
                                Pythonic way to create a dictionary from a list where the keys are the elements that are found in another list and values are elements between keys
                            
                                Predict classes or class probabilities?
                            
                                Can I avoid a sorted dictionary output after I've used pprint.pprint, in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Find group of consecutive dates in Pandas DataFrame

Tags:

python

datetime

pandas

smm

People also ask

1 Answers

Cainã Max Couto-Silva

Recent Activity

Donate For Us