Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filling NaN using the mean values from the same hour of previous days

I want to fill NaN using a mean of values from the same hour-minutes of the previous days. To simplify, this is an example of my df.

timstamp         data
22/04/2016 09:00 1
22/04/2016 09:05 2
...
23/04/2016 09:00 3
23/04/2016 09:05 4
...
24/04/2016 09:00 5
24/04/2016 09:05 6
...
25/04/2016 09:00 7
25/04/2016 09:05 8
...
25/04/2016 10:00 NaN
25/04/2016 10:05 NaN

The real data contains many days with continuous 5-minute intervals.

df = df.groupby(df.index.minute).fillna(df.data.rolling(3).mean()) attempted to do rolling mean from the previous hour-minute on the past days but it did not work.

An alternative method of df = df.groupby(df.index.minute).ffill() take values from the previous two lines (i.e. 7 and 8) which comes from the same minutes of the previous hour on the same day.

However, I want the following result:

timstamp         data
22/04/2016 09:00 1
22/04/2016 09:05 2
...
23/04/2016 09:00 3
23/04/2016 09:05 4
...
24/04/2016 09:00 5
24/04/2016 09:05 6
...
25/04/2016 09:00 7
25/04/2016 09:05 8
25/04/2016 10:00 3
25/04/2016 10:05 4

where the value 3 (second last line) is the mean of the values from the same hour-minute of the previous days (mean of 1, 3, and 5), and 4 (last line) is the mean of 2, 4, and 6. Given the size of my df, I want to take a mean from tens of previous days.

Edit
I am getting closer. Using the following code, the mean of the data is calculated by similar hours and minutes as I wanted:

df.set_index('timstamp', inplace=True)
df=df.groupby([df.index.hour, df.index.minute]).mean()
df.index.names = ["hour", "minute"]

But, it uses the entire data to get the hour-minute mean. What I wanted is to use the same hour-minutes from the previous days only, where I can set the number of past days in the calculation. Then, the resulted mean value shall be used to fill the NaN.

like image 652
k.ko3n Avatar asked Nov 06 '22 18:11

k.ko3n


1 Answers

Let's try this:

# time sample every 5 mins
idx = pd.date_range('2018-01-01', '2018-01-31', freq='300s')
np.random.seed(2019)

# create toy data
df = pd.DataFrame({'idx':idx,
                   'data':np.random.uniform(0,5, len(idx))})
df.loc[np.random.uniform(0,1,len(idx)) > 0.95, 'data'] = None

# means by the hour, can also use median
means = df.resample('H', on='idx').data.mean()

# get the timestamp on the hour
df['hour'] = df['idx'] - pd.to_timedelta(df.idx.dt.minute, unit='m')

# get the hour stamp of previous day
df['hour'] -= pd.to_timedelta(1, unit='d')

# update NaN
# df.loc[df.data.isna(), 'data'] = means[nan_hour]

# the original mapping raised a ValueError due to duplicates in nan_hour
df.loc[df.data.isna(), 'data'] = df.loc[df.data.isna(), 'hour'].\   
                                    replace({'hour': means})
like image 84
Quang Hoang Avatar answered Nov 14 '22 22:11

Quang Hoang