I have a dataframe that contains hourly data:
area date hour output
H1 2018-07-01 07:00:00 150
H1 2018-07-01 08:00:00 150
H1 2018-07-01 09:00:00 100
H1 2018-07-01 11:00:00 150
H2 2018-07-01 09:00:00 100
H2 2018-07-01 10:00:00 50
H2 2018-07-01 11:00:00 50
H2 2018-07-01 12:00:00 150
but the data only contains row for the hours when there was output, how can I fill in the missing hours for each area with output 0? For example add two rows for H1:
area date hour output
H1 2018-07-01 10:00:00 0
H1 2018-07-01 12:00:00 0
I can assume that the min and max hour for all areas are the beginning and end of the sample period (in this case 7:00:00 and 12:00:00)
Right now, I'm creating a dataframe containing all the hours from 7:00 to 12:00 for each area and then doing a merge of my data with that dataframe, and then filling the NaN with 0s. This is very slow as my data set can have millions of rows.
Is there any better way of doing this?
You may check resample
with groupby
df['Datetime']=pd.to_datetime(df.date+' '+df.hour)# combine hour and date to datetime
df.drop(['date','hour'],inplace=True,axis = 1)# drop duplicate infomation
df.groupby('area').\
apply(lambda x : x.set_index('Datetime').resample('H').mean().fillna(0)).\
reset_index()
Out[662]:
area Datetime output
0 H1 2018-07-01 07:00:00 150.0
1 H1 2018-07-01 08:00:00 150.0
2 H1 2018-07-01 09:00:00 100.0
3 H1 2018-07-01 10:00:00 0.0
4 H1 2018-07-01 11:00:00 150.0
5 H2 2018-07-01 09:00:00 100.0
6 H2 2018-07-01 10:00:00 50.0
7 H2 2018-07-01 11:00:00 50.0
8 H2 2018-07-01 12:00:00 150.0
You can create a date range of min and max and merge your dataframe with the existing and fill values with null
df
area date hour output
0 H1 2018-07-01 07:00:00 07:00:00 150
1 H1 2018-07-01 08:00:00 08:00:00 150
2 H1 2018-07-01 09:00:00 09:00:00 100
6 H2 2018-07-01 11:00:00 11:00:00 50
7 H2 2018-07-01 12:00:00 12:00:00 150
df = pd.DataFrame(pd.date_range(pd.to_datetime(df['date'] +' ' + df['hour']).min(),pd.to_datetime(df['date'] +' ' + df['hour']).max(),freq='H'),columns= ['date']).merge(df,on=['date'],how='outer').fillna(0)
df.hour = df.date.dt.strftime('%H:%M:%S')
df.date = df.date.dt.strftime('%d-%m-%Y')
df
Out:
date area hour output
0 01-07-2018 H1 07:00:00 150.0
1 01-07-2018 H1 08:00:00 150.0
2 01-07-2018 H1 09:00:00 100.0
3 01-07-2018 0 10:00:00 0.0
4 01-07-2018 H2 11:00:00 50.0
5 01-07-2018 H2 12:00:00 150.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With