I have a data frame df
:
df =
index date hats
A1 01-01-2020 5
A1 02-01-2020 10
A1 03-01-2020 16
A1 04-01-2020 16
A1 21-01-2020 9
A1 22-01-2020 8
A1 23-01-2020 7
A6 20-03-2020 5
A6 21-03-2020 5
A8 30-07-2020 12
Here, the first four rows are consecutive days. I want to know the start date and end date of all such consecutive days in the data frame. If there is only one day in a series like wise A8
index in the df
then the start date and and end will be same. Moreover, I am also interested in knowing the highest value in df['hats']
column in the series of consecutive days and return its date in a seperate column high_hat
along with its date high_hat_date
. If there are two or more equal high values in a series of consecutive days then write the number of occurrence of high value in a new column num_hat
, and write the first occurrence date in high_hat_date
.
The example output for the above data frame is as follows:
index start_date end_date high_hat high_hat_date num_hat
A1 01-01-2020 04-01-2020 16 03-01-2020 2
A1 21-01-2020 23-01-2020 9 21-01-2020 1
A6 20-03-2020 21-03-2020 5 20-03-2020 2
A8 30-07-2020 30-07-2020 12 30-07-2020 1
Any help in this regard is highly appreciated.
First using pd.to_datetime
convert the date
column to pandas datetime
series:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
Then use:
g = df.groupby('index')['date'].diff().dt.days.ne(1).cumsum() # STEP A
m = df.groupby(['index', g])['hats'].transform('max').eq(df['hats']) # STEP B
df = df.assign(high_hats=df['hats'].mask(~m), high_date=df['date'].mask(~m)) # STEP C
dct = {'start_date': ('date', 'first'), 'end_date': ('date', 'last'), 'high_hat': ('hats', 'max'),
'high_hat_date': ('high_date', 'first'), 'num_hats': ('high_hats', 'count')}
df1 = df.groupby(['index', g]).agg(**dct).reset_index().drop('date', 1) # STEP D
Details:
STEP A: Use DataFrame.groupby
on index
and use groupby.diff
on date
to calculate the the days elapsed between successive dates then use Series.dt.days
+ Series.ne
along with Series.cumsum
to create a grouping series g
which will be needed to group the dataframe on consecutive dates.
# print(g)
0 1
1 1
2 1
3 1
4 2
5 2
6 2
7 3
8 3
9 4
Name: date, dtype: int64
STEP B: Use DataFrame.groupby
on index
and g
and use groupby.transform
to transform the column hats
using max
then using Series.eq
equate it with hats
column to create a boolean mask m
.
# print(m)
0 False
1 False
2 True
3 True
4 True
5 False
6 False
7 True
8 True
9 True
Name: hats, dtype: bool
STEP C: Next use DataFrame.assign
to assign two new columns high_hats
and high_date
which will be used in STEP D
for calculation of high_hat_date
and num_hats
.
# print(df)
index date hats high_hats high_date
0 A1 2020-01-01 5 NaN NaT
1 A1 2020-01-02 10 NaN NaT
2 A1 2020-01-03 16 16.0 2020-01-03
3 A1 2020-01-04 16 16.0 2020-01-04
4 A1 2020-01-21 9 9.0 2020-01-21
5 A1 2020-01-22 8 NaN NaT
6 A1 2020-01-23 7 NaN NaT
7 A6 2020-03-20 5 5.0 2020-03-20
8 A6 2020-03-21 5 5.0 2020-03-21
9 A8 2020-07-30 12 12.0 2020-07-30
STEP D: Using DataFrame.groupby
on index
and g
and aggregate the dataframe using the aggregation dictionary dct
which contains all the columns and their corresponding agg
functions to be applied.
# print(df1)
index start_date end_date high_hat high_hat_date num_hats
0 A1 2020-01-01 2020-01-04 16 2020-01-03 2
1 A1 2020-01-21 2020-01-23 9 2020-01-21 1
2 A6 2020-03-20 2020-03-21 5 2020-03-20 2
3 A8 2020-07-30 2020-07-30 12 2020-07-30 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With