Resample pandas dataframe and interpolate missing values for timeseries data

Question

I need to resample timeseries data and interpolate missing values in 15 min intervals over the course of an hour. Each ID should have four rows of data per hour.

In:

ID            Time  Value
1   1/1/2019 12:17      3
1   1/1/2019 12:44      2
2   1/1/2019 12:02      5
2   1/1/2019 12:28      7

Out:

ID                Time  Value
1  2019-01-01 12:00:00    3.0
1  2019-01-01 12:15:00    3.0
1  2019-01-01 12:30:00    2.0
1  2019-01-01 12:45:00    2.0
2  2019-01-01 12:00:00    5.0
2  2019-01-01 12:15:00    7.0
2  2019-01-01 12:30:00    7.0
2  2019-01-01 12:45:00    7.0

I wrote a function to do this, however efficiency goes down drastically when trying to process a larger dataset.

Is there a more efficient way to do this?

import datetime
import pandas as pd


data = pd.DataFrame({'ID': [1,1,2,2], 
                    'Time': ['1/1/2019 12:17','1/1/2019 12:44','1/1/2019 12:02','1/1/2019 12:28'], 
                    'Value': [3,2,5,7]})


def clean_dataset(data):
    ids = data.drop_duplicates(subset='ID')
    data['Time'] = pd.to_datetime(data['Time'])
    data['Time'] = data['Time'].apply(
    lambda dt: datetime.datetime(dt.year, dt.month, dt.day, dt.hour,15*(dt.minute // 15)))
    data = data.drop_duplicates(subset=['Time','ID']).reset_index(drop=True)
    df = pd.DataFrame(columns=['Time','ID','Value'])
    for i in range(ids.shape[0]):
        times = pd.DataFrame(pd.date_range('1/1/2019 12:00','1/1/2019 13:00',freq='15min'),columns=['Time'])
        id_data = data[data['ID']==ids.iloc[i]['ID']]
        clean_data = times.join(id_data.set_index('Time'), on='Time')
        clean_data = clean_data.interpolate(method='linear', limit_direction='both')
        clean_data.drop(clean_data.tail(1).index,inplace=True)
        df = df.append(clean_data)
    return df


clean_dataset(data)

Nathaniel · Accepted Answer

Linear interpolation does become slow with a large data set. Having a loop in your code is also responsible for a large part of the slowdown. Anything that can be removed from the loop and pre-computed will help increase efficiency. For example, if you pre-define the data frame that you use to initialize times, the code becomes 14% more efficient:

times_template = pd.DataFrame(pd.date_range('1/1/2019 12:00','1/1/2019 13:00',freq='15min'),columns=['Time'])
for i in range(ids.shape[0]):
    times = times_template.copy()

Profiling your code confirms that the interpolation takes the longest amount of time (22.7%), followed by the join (13.1%), the append (7.71%), and then the drop (7.67%) commands.

Resample pandas dataframe and interpolate missing values for timeseries data

Tags:

python

pandas

numpy

scipy

data-science

primo7

1 Answers

Nathaniel

Recent Activity

Donate For Us

Resample pandas dataframe and interpolate missing values for timeseries data

Tags:

python

pandas

numpy

scipy

data-science

primo7

1 Answers

Nathaniel

Related questions

Recent Activity

Donate For Us