I have a dataframe df like the one below:
city datetime value
0 city_a 2020-07-10 2
1 city_a 2020-07-11 5
2 city_b 2020-07-11 4
And I am trying to resample the daily datetimes with a 6h frequency (data every 00h, 6h, 12h and 18h).
The following code gives me almost the output I am expecting
my_df = my_df.set_index(['datetime', 'city'])
my_df = my_df.unstack(-1).resample('6H').pad()
my_df = my_df.stack().reset_index()
my_df = my_df[['city', 'datetime', 'value']]
my_df = my_df.sort_values(['city', 'datetime'])
Output:
city datetime value
0 city_a 2020-07-10 00:00:00 2.0
1 city_a 2020-07-10 06:00:00 2.0
2 city_a 2020-07-10 12:00:00 2.0
3 city_a 2020-07-10 18:00:00 2.0
4 city_a 2020-07-11 00:00:00 5.0
5 city_b 2020-07-11 00:00:00 4.0
However, we can see that the day 2020-07-11 is not complete. I would like the rows including 2020-07-11 06:00:00, 12:00:00 and 18:00:00 to appear into the output.
So my expected output should be:
city datetime value
0 city_a 2020-07-10 00:00:00 2.0
1 city_a 2020-07-10 06:00:00 2.0
2 city_a 2020-07-10 12:00:00 2.0
3 city_a 2020-07-10 18:00:00 2.0
4 city_a 2020-07-11 00:00:00 5.0
6 city_a 2020-07-11 06:00:00 5.0
8 city_a 2020-07-11 12:00:00 5.0
10 city_a 2020-07-11 18:00:00 5.0
5 city_b 2020-07-11 00:00:00 4.0
7 city_b 2020-07-11 06:00:00 4.0
9 city_b 2020-07-11 12:00:00 4.0
11 city_b 2020-07-11 18:00:00 4.0
Is there an elegant way to do it with Pandas ?
Code to generate the dataframe:
my_df = pd.DataFrame(data = {
'city': ['city_a', 'city_a', 'city_b'],
'datetime':
[pd.to_datetime('2020/07/10'),pd.to_datetime('2020/07/11'),pd.to_datetime('2020/07/11')],
'value': [2,5,4]
})
Use:
# STEP A
df1 = (df.groupby('city')['datetime'].max() + pd.Timedelta(days=1)).reset_index()
# STEP B
df1 = pd.concat([df, df1]).set_index('datetime')
# STEP C
df1 = df1.groupby('city', as_index=False).resample('6H').ffill()
# STEP D
df1 = df1.reset_index().drop('level_0', 1).dropna(subset=['value'])
Details:
STEP A: Use DataFrame.groupby
to group the dataframe on city
to determine the maximum value of date in each group and add 1 day
to max value of every group, this will be needed for resampling the dataframe.
# print(df1)
city datetime
0 city_a 2020-07-12
1 city_b 2020-07-12
STEP B: Using pd.concat
to concat the original dataframe df
to the newly created dataframe df1
, this is because we have to resample the dataframe in STEP C.
# print(df1)
city value
datetime
2020-07-10 city_a 2.0
2020-07-11 city_a 5.0
2020-07-11 city_b 4.0
2020-07-12 city_a NaN
2020-07-12 city_b NaN
STEP C: Using DataFrame.resample
resample the dataframe grouped on city
with a frequency of 6H
and use ffill
to forward fill the values.
# print(df1)
city value
datetime
0 2020-07-10 00:00:00 city_a 2.0
2020-07-10 06:00:00 city_a 2.0
2020-07-10 12:00:00 city_a 2.0
2020-07-10 18:00:00 city_a 2.0
2020-07-11 00:00:00 city_a 5.0
2020-07-11 06:00:00 city_a 5.0
2020-07-11 12:00:00 city_a 5.0
2020-07-11 18:00:00 city_a 5.0
2020-07-12 00:00:00 city_a NaN
1 2020-07-11 00:00:00 city_b 4.0
2020-07-11 06:00:00 city_b 4.0
2020-07-11 12:00:00 city_b 4.0
2020-07-11 18:00:00 city_b 4.0
2020-07-12 00:00:00 city_b NaN
STEP D: Finally use DataFrame.reset_index
and drop the unused columns using DataFrame.drop
along axis=1
, also use DataFrame.dropna
to drop the rows with NaN
values in column value
.
# print(df1)
datetime city value
0 2020-07-10 00:00:00 city_a 2.0
1 2020-07-10 06:00:00 city_a 2.0
2 2020-07-10 12:00:00 city_a 2.0
3 2020-07-10 18:00:00 city_a 2.0
4 2020-07-11 00:00:00 city_a 5.0
5 2020-07-11 06:00:00 city_a 5.0
6 2020-07-11 12:00:00 city_a 5.0
7 2020-07-11 18:00:00 city_a 5.0
9 2020-07-11 00:00:00 city_b 4.0
10 2020-07-11 06:00:00 city_b 4.0
11 2020-07-11 12:00:00 city_b 4.0
12 2020-07-11 18:00:00 city_b 4.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With