CustID UsageDate EnergyConsumed
0 17111 2018-01-01 00:00:00 1.095
1 17111 2018-01-01 01:00:00 1.129
2 17111 2018-01-01 02:00:00 1.165
3 17111 2018-01-01 03:00:00 1.833
4 17111 2018-01-01 04:00:00 1.697
5 17111 2018-01-01 05:00:00 1.835
missing data point 1
6 17111 2018-01-01 07:00:00 1.835
7 17112 2018-01-01 00:00:00 1.095
8 17112 2018-01-01 01:00:00 1.129
missing data point 1
9 17112 2018-01-01 03:00:00 1.833
10 17112 2018-01-01 04:00:00 1.697
11 17112 2018-01-01 05:00:00 1.835
For every customer, I have hourly data. However, some data points are missing in between. I want to check the Min and Max of Usage Date and fill in the missing Usage Date in that time interval (all values are per hour) and EnergyConsumed as zero. I can later use ffill or backfill to take care of this.
Not every customer's max UsageDate is 2018-01-31 23:00:00. So we only want to extend the series till the max date of every customer.
missing point 1 is replaced by
17111 2018-01-01 06:00:00 0
missing point 2 is replaced by
17112 2018-01-01 02:00:00 0
My main point of trouble is how to find the min and max date of every customer and then generate the gaps of dates.
I have tried indexing by date and resampling but havent helped me reach the solution.
Also, I was wondering if there is a way to directly find customerID's which have missing values in the pattern described above. My data is very large and the solution provided by @Vaishali is computing heavy. Any inputs would be helpful!
You can group the Dataframe by custid and create index with desired date range. Now use this index to reindex the data
df['UsageDate'] = pd.to_datetime(df['UsageDate'])
idx = df.groupby('CustID')['UsageDate'].apply(lambda x: pd.Series(index = pd.date_range(x.min(), x.max(), freq = 'H'))).index
df.set_index(['CustID', 'UsageDate']).reindex(idx).fillna(0).reset_index().rename(columns = {'level_1':'UsageDate'})
CustID UsageDate EnergyConsumed
0 17111 2018-01-01 00:00:00 1.095
1 17111 2018-01-01 01:00:00 1.129
2 17111 2018-01-01 02:00:00 1.165
3 17111 2018-01-01 03:00:00 1.833
4 17111 2018-01-01 04:00:00 1.697
5 17111 2018-01-01 05:00:00 1.835
6 17111 2018-01-01 06:00:00 0.000
7 17111 2018-01-01 07:00:00 1.835
8 17112 2018-01-01 00:00:00 1.095
9 17112 2018-01-01 01:00:00 1.129
10 17112 2018-01-01 02:00:00 0.000
11 17112 2018-01-01 03:00:00 1.833
12 17112 2018-01-01 04:00:00 1.697
13 17112 2018-01-01 05:00:00 1.835
Explanation: Since the Usagedates have to be all the dates in the range of minimum and maximum date for that CustID, we group the data by CustID and create a series of min and max dates using date_range. Set the dates as index of the series rather than value. The result of the groupby will be a multiindex with CUSTID as level 0 and usage date as level 1. We now use this multiindex to reindex the original dataframe. It will use the values where the index matches, assign NaN at the rest. Finally convert the NaN to 0 using fillna.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With