I need to resample timeseries data and interpolate missing values in 15 min intervals over the course of an hour. Each ID should have four rows of data per hour.
In:
ID Time Value
1 1/1/2019 12:17 3
1 1/1/2019 12:44 2
2 1/1/2019 12:02 5
2 1/1/2019 12:28 7
Out:
ID Time Value
1 2019-01-01 12:00:00 3.0
1 2019-01-01 12:15:00 3.0
1 2019-01-01 12:30:00 2.0
1 2019-01-01 12:45:00 2.0
2 2019-01-01 12:00:00 5.0
2 2019-01-01 12:15:00 7.0
2 2019-01-01 12:30:00 7.0
2 2019-01-01 12:45:00 7.0
I wrote a function to do this, however efficiency goes down drastically when trying to process a larger dataset.
Is there a more efficient way to do this?
import datetime
import pandas as pd
data = pd.DataFrame({'ID': [1,1,2,2],
'Time': ['1/1/2019 12:17','1/1/2019 12:44','1/1/2019 12:02','1/1/2019 12:28'],
'Value': [3,2,5,7]})
def clean_dataset(data):
ids = data.drop_duplicates(subset='ID')
data['Time'] = pd.to_datetime(data['Time'])
data['Time'] = data['Time'].apply(
lambda dt: datetime.datetime(dt.year, dt.month, dt.day, dt.hour,15*(dt.minute // 15)))
data = data.drop_duplicates(subset=['Time','ID']).reset_index(drop=True)
df = pd.DataFrame(columns=['Time','ID','Value'])
for i in range(ids.shape[0]):
times = pd.DataFrame(pd.date_range('1/1/2019 12:00','1/1/2019 13:00',freq='15min'),columns=['Time'])
id_data = data[data['ID']==ids.iloc[i]['ID']]
clean_data = times.join(id_data.set_index('Time'), on='Time')
clean_data = clean_data.interpolate(method='linear', limit_direction='both')
clean_data.drop(clean_data.tail(1).index,inplace=True)
df = df.append(clean_data)
return df
clean_dataset(data)
Linear interpolation does become slow with a large data set. Having a loop in your code is also responsible for a large part of the slowdown. Anything that can be removed from the loop and pre-computed will help increase efficiency. For example, if you pre-define the data frame that you use to initialize times
, the code becomes 14% more efficient:
times_template = pd.DataFrame(pd.date_range('1/1/2019 12:00','1/1/2019 13:00',freq='15min'),columns=['Time'])
for i in range(ids.shape[0]):
times = times_template.copy()
Profiling your code confirms that the interpolation takes the longest amount of time (22.7%), followed by the join (13.1%), the append (7.71%), and then the drop (7.67%) commands.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With