Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find missing values of datetime for every customer

    CustID  UsageDate               EnergyConsumed
0   17111   2018-01-01 00:00:00     1.095
1   17111   2018-01-01 01:00:00     1.129
2   17111   2018-01-01 02:00:00     1.165
3   17111   2018-01-01 03:00:00     1.833
4   17111   2018-01-01 04:00:00     1.697
5   17111   2018-01-01 05:00:00     1.835
missing data point 1
6   17111   2018-01-01 07:00:00     1.835
7   17112   2018-01-01 00:00:00     1.095
8   17112   2018-01-01 01:00:00     1.129
missing data point 1
9   17112   2018-01-01 03:00:00     1.833
10  17112   2018-01-01 04:00:00     1.697
11  17112   2018-01-01 05:00:00     1.835

For every customer, I have hourly data. However, some data points are missing in between. I want to check the Min and Max of Usage Date and fill in the missing Usage Date in that time interval (all values are per hour) and EnergyConsumed as zero. I can later use ffill or backfill to take care of this.

Not every customer's max UsageDate is 2018-01-31 23:00:00. So we only want to extend the series till the max date of every customer.

missing point 1 is replaced by

17111        2018-01-01 06:00:00     0

missing point 2 is replaced by

17112        2018-01-01 02:00:00     0

My main point of trouble is how to find the min and max date of every customer and then generate the gaps of dates.

I have tried indexing by date and resampling but havent helped me reach the solution.

Also, I was wondering if there is a way to directly find customerID's which have missing values in the pattern described above. My data is very large and the solution provided by @Vaishali is computing heavy. Any inputs would be helpful!

like image 742
rAmAnA Avatar asked Oct 16 '22 07:10

rAmAnA


1 Answers

You can group the Dataframe by custid and create index with desired date range. Now use this index to reindex the data

df['UsageDate'] = pd.to_datetime(df['UsageDate'])

idx = df.groupby('CustID')['UsageDate'].apply(lambda x: pd.Series(index = pd.date_range(x.min(), x.max(), freq = 'H'))).index

df.set_index(['CustID', 'UsageDate']).reindex(idx).fillna(0).reset_index().rename(columns = {'level_1':'UsageDate'})

    CustID  UsageDate               EnergyConsumed
0   17111   2018-01-01 00:00:00     1.095
1   17111   2018-01-01 01:00:00     1.129
2   17111   2018-01-01 02:00:00     1.165
3   17111   2018-01-01 03:00:00     1.833
4   17111   2018-01-01 04:00:00     1.697
5   17111   2018-01-01 05:00:00     1.835
6   17111   2018-01-01 06:00:00     0.000
7   17111   2018-01-01 07:00:00     1.835
8   17112   2018-01-01 00:00:00     1.095
9   17112   2018-01-01 01:00:00     1.129
10  17112   2018-01-01 02:00:00     0.000
11  17112   2018-01-01 03:00:00     1.833
12  17112   2018-01-01 04:00:00     1.697
13  17112   2018-01-01 05:00:00     1.835

Explanation: Since the Usagedates have to be all the dates in the range of minimum and maximum date for that CustID, we group the data by CustID and create a series of min and max dates using date_range. Set the dates as index of the series rather than value. The result of the groupby will be a multiindex with CUSTID as level 0 and usage date as level 1. We now use this multiindex to reindex the original dataframe. It will use the values where the index matches, assign NaN at the rest. Finally convert the NaN to 0 using fillna.

like image 59
Vaishali Avatar answered Oct 21 '22 02:10

Vaishali