Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Averaging several time-series together with confidence interval (with test code)

Sounds very complicated but a simple plot will make it easy to understand: enter image description here I have three curves of cumulative sum of some values over time, which are the blue lines.

I want to average (or somehow combine in a statistically correct way) the three curves into one smooth curve and add confidence interval.

I tried one simple solution - combining all the data into one curve, average it with the "rolling" function in pandas, getting the standard deviation for it. I plotted those as the purple curve with the confidence interval around it.

The problem with my real data, and as illustrated in the plot above is the curve isn't smooth at all, also there are sharp jumps in the confidence interval which also isn't a good representation of the 3 separate curves as there is no jumps in them.

Is there a better way to represent the 3 different curves in one smooth curve with a nice confidence interval?

I supply a test code, tested on python 3.5.1 with numpy and pandas (don't change the seed in order to get the same curves).

There are some constrains - increasing the number of points for the "rolling" function isn't a solution for me because some of my data is too short for that.

Test code:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
np.random.seed(seed=42)


## data generation - cumulative analysis over time
df1_time = pd.DataFrame(np.random.uniform(0,1000,size=50), columns=['time'])
df1_values = pd.DataFrame(np.random.randint(0,10000,size=100), columns=['vals'])
df1_combined_sorted =  pd.concat([df1_time, df1_values], axis = 1).sort_values(by=['time'])
df1_combined_sorted_cumulative = np.cumsum(df1_combined_sorted['vals'])

df2_time = pd.DataFrame(np.random.uniform(0,1000,size=50), columns=['time'])
df2_values = pd.DataFrame(np.random.randint(1000,13000,size=100), columns=['vals'])
df2_combined_sorted =  pd.concat([df2_time, df2_values], axis = 1).sort_values(by=['time'])
df2_combined_sorted_cumulative = np.cumsum(df2_combined_sorted['vals'])

df3_time = pd.DataFrame(np.random.uniform(0,1000,size=50), columns=['time'])
df3_values = pd.DataFrame(np.random.randint(0,4000,size=100), columns=['vals'])
df3_combined_sorted =  pd.concat([df3_time, df3_values], axis = 1).sort_values(by=['time'])
df3_combined_sorted_cumulative = np.cumsum(df3_combined_sorted['vals'])


## combining the three curves
df_all_vals_cumulative = pd.concat([df1_combined_sorted_cumulative,.
    df2_combined_sorted_cumulative, df3_combined_sorted_cumulative]).reset_index(drop=True)
df_all_time =  pd.concat([df1_combined_sorted['time'],
    df2_combined_sorted['time'], df3_combined_sorted['time']]).reset_index(drop=True)
df_all = pd.concat([df_all_time, df_all_vals_cumulative], axis = 1)


## creating confidence intervals 
df_all_sorted = df_all.sort_values(by=['time'])
ma = df_all_sorted.rolling(10).mean()
mstd = df_all_sorted.rolling(10).std()


## plotting
plt.fill_between(df_all_sorted['time'], ma['vals'] - 2 * mstd['vals'],
        ma['vals'] + 2 * mstd['vals'],color='b', alpha=0.2)
plt.plot(df_all_sorted['time'],ma['vals'], c='purple')
plt.plot(df1_combined_sorted['time'], df1_combined_sorted_cumulative, c='blue')
plt.plot(df2_combined_sorted['time'], df2_combined_sorted_cumulative, c='blue')
plt.plot(df3_combined_sorted['time'], df3_combined_sorted_cumulative, c='blue')
matplotlib.use('Agg')
plt.show()
like image 730
artembus Avatar asked Sep 14 '25 08:09

artembus


1 Answers

First of all, your sample code could be re-written to make better use of pd. For example

np.random.seed(seed=42)

## data generation - cumulative analysis over time
def get_data(max_val, max_time=1000):
    times = pd.DataFrame(np.random.uniform(0,max_time,size=50), columns=['time'])
    vals = pd.DataFrame(np.random.randint(0,max_val,size=100), columns=['vals'])
    df =  pd.concat([times, vals], axis = 1).sort_values(by=['time']).\
            reset_index().drop('index', axis=1)
    df['cumulative'] = df.vals.cumsum()
    return df

# generate the dataframes
df1,df2,df3 = (df for df in map(get_data, [10000, 13000, 4000]))
dfs = (df1, df2, df3)

# join 
df_all = pd.concat(dfs, ignore_index=True).sort_values(by=['time'])

# render function
def render(window=10):
    # compute rolling means and confident intervals
    mean_val = df_all.cumulative.rolling(window).mean()
    std_val = df_all.cumulative.rolling(window).std()
    min_val = mean_val - 2*std_val
    max_val = mean_val + 2*std_val

    plt.figure(figsize=(16,9))
    for df in dfs:
        plt.plot(df.time, df.cumulative, c='blue')

    plt.plot(df_all.time, mean_val, c='r')
    plt.fill_between(df_all.time, min_val, max_val, color='blue', alpha=.2)
    plt.show()

The reason your curves aren't that smooth is maybe your rolling window is not large enough. You can increase this window size to get smoother graphs. For example render(20) gives: enter image description here

while render(30) gives: enter image description here

Although, the better way might be imputing each of df['cumulative'] to the entire time window and compute the mean/confidence interval on these series. With that in mind, we can modify the code as follows:

np.random.seed(seed=42)

## data generation - cumulative analysis over time
def get_data(max_val, max_time=1000):
    times = pd.DataFrame(np.random.uniform(0,max_time,size=50), columns=['time'])
    vals = pd.DataFrame(np.random.randint(0,max_val,size=100), columns=['vals'])
    # note that we set time as index of the returned data
    df =  pd.concat([times, vals], axis = 1).dropna().set_index('time').sort_index()
    df['cumulative'] = df.vals.cumsum()
    return df

df1,df2,df3 = (df for df in map(get_data, [10000, 13000, 4000]))
dfs = (df1, df2, df3)

# rename column for later plotting
for i,df in zip(range(3),dfs):
    df.rename(columns={'cumulative':f'cummulative_{i}'}, inplace=True)

# concatenate the dataframes with common time index
df_all = pd.concat(dfs,sort=False).sort_index()

# interpolate each cumulative column linearly
df_all.interpolate(inplace=True)

# plot graphs
mean_val = df_all.iloc[:,1:].mean(axis=1)
std_val = df_all.iloc[:,1:].std(axis=1)
min_val = mean_val - 2*std_val
max_val = mean_val + 2*std_val

fig, ax = plt.subplots(1,1,figsize=(16,9))
df_all.iloc[:,1:4].plot(ax=ax)

plt.plot(df_all.index, mean_val, c='purple')
plt.fill_between(df_all.index, min_val, max_val, color='blue', alpha=.2)
plt.show()

and we get: enter image description here

like image 141
Quang Hoang Avatar answered Sep 15 '25 23:09

Quang Hoang