I am trying to get the chunks of data where there's consecutive dates from the Pandas DataFrame. My df
looks like below.
DateAnalyzed Val
1 2018-03-18 0.470253
2 2018-03-19 0.470253
3 2018-03-20 0.470253
4 2018-09-25 0.467729
5 2018-09-26 0.467729
6 2018-09-27 0.467729
In this df
, I want to get the first 3 rows, do some processing and then get the last 3 rows and do processing on that.
I calculated the difference with 1 lag by applying following code.
df['Delta']=(df['DateAnalyzed'] - df['DateAnalyzed'].shift(1))
But after then I can't figure out that how to get the groups of consecutive rows without iterating.
In this, we check consecutive dates by checking days difference from the previous date using days(). The iteration of all dates is done using a loop.
Pandas groupby is used for grouping the data according to the categories and apply a function to the categories. It also helps to aggregate data efficiently. Pandas dataframe. groupby() function is used to split the data into groups based on some criteria.
There were similar questions after this one here and here, with more specific outputs requirements. Since this one is more general, I would like to contribute here as well.
We can easily assign an unique identifier to consecutive groups with one-line code:
df['grp_date'] = df.DateAnalyzed.diff().dt.days.ne(1).cumsum()
Here, every time we see a date with a difference greater than a day, we add a value to that date, otherwise it remains with the previous value so that we end up with a unique identifier per group.
See the output:
DateAnalyzed Val grp_date
1 2018-03-18 0.470253 1
2 2018-03-19 0.470253 1
3 2018-03-20 0.470253 1
4 2018-09-25 0.467729 2
5 2018-09-26 0.467729 2
6 2018-09-27 0.467729 2
Now, it's easy to groupby
"grp_date" and do whatever you wanna do with apply
or agg
.
Examples:
# Sum across consecutive days (or any other method from pandas groupby)
df.groupby('grp_date').sum()
# Get the first value and last value per consecutive days
df.groupby('grp_date').apply(lambda x: x.iloc[[0, -1]])
# or df.groupby('grp_date').head(n) for first n days
# Perform custom operation across target-columns
df.groupby('grp_date').apply(lambda x: (x['col1'] + x['col2']) / x['Val'].mean())
# Multiple operations for a target-column
df.groupby('grp_date').Val.agg(['min', 'max', 'mean', 'std'])
# and so on...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With