Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find group of consecutive dates in Pandas DataFrame

I am trying to get the chunks of data where there's consecutive dates from the Pandas DataFrame. My df looks like below.

      DateAnalyzed           Val
1       2018-03-18      0.470253
2       2018-03-19      0.470253
3       2018-03-20      0.470253
4       2018-09-25      0.467729
5       2018-09-26      0.467729
6       2018-09-27      0.467729

In this df, I want to get the first 3 rows, do some processing and then get the last 3 rows and do processing on that.

I calculated the difference with 1 lag by applying following code.

df['Delta']=(df['DateAnalyzed'] - df['DateAnalyzed'].shift(1))

But after then I can't figure out that how to get the groups of consecutive rows without iterating.

like image 959
smm Avatar asked Oct 20 '18 00:10

smm


People also ask

How do you find consecutive days in Python?

In this, we check consecutive dates by checking days difference from the previous date using days(). The iteration of all dates is done using a loop.

What is group by () in pandas library?

Pandas groupby is used for grouping the data according to the categories and apply a function to the categories. It also helps to aggregate data efficiently. Pandas dataframe. groupby() function is used to split the data into groups based on some criteria.


1 Answers

There were similar questions after this one here and here, with more specific outputs requirements. Since this one is more general, I would like to contribute here as well.

We can easily assign an unique identifier to consecutive groups with one-line code:

df['grp_date'] = df.DateAnalyzed.diff().dt.days.ne(1).cumsum()

Here, every time we see a date with a difference greater than a day, we add a value to that date, otherwise it remains with the previous value so that we end up with a unique identifier per group.

See the output:

  DateAnalyzed       Val  grp_date
1   2018-03-18  0.470253         1
2   2018-03-19  0.470253         1
3   2018-03-20  0.470253         1
4   2018-09-25  0.467729         2
5   2018-09-26  0.467729         2
6   2018-09-27  0.467729         2

Now, it's easy to groupby "grp_date" and do whatever you wanna do with apply or agg.


Examples:

# Sum across consecutive days (or any other method from pandas groupby)
df.groupby('grp_date').sum()

# Get the first value and last value per consecutive days
df.groupby('grp_date').apply(lambda x: x.iloc[[0, -1]])
# or df.groupby('grp_date').head(n) for first n days

# Perform custom operation across target-columns
df.groupby('grp_date').apply(lambda x: (x['col1'] + x['col2']) / x['Val'].mean())

# Multiple operations for a target-column
df.groupby('grp_date').Val.agg(['min', 'max', 'mean', 'std'])

# and so on...
like image 92
Cainã Max Couto-Silva Avatar answered Sep 17 '22 00:09

Cainã Max Couto-Silva