Sounds very complicated but a simple plot will make it easy to understand:
I have three curves of cumulative sum of some values over time, which are the blue lines.
I want to average (or somehow combine in a statistically correct way) the three curves into one smooth curve and add confidence interval.
I tried one simple solution - combining all the data into one curve, average it with the "rolling" function in pandas, getting the standard deviation for it. I plotted those as the purple curve with the confidence interval around it.
The problem with my real data, and as illustrated in the plot above is the curve isn't smooth at all, also there are sharp jumps in the confidence interval which also isn't a good representation of the 3 separate curves as there is no jumps in them.
Is there a better way to represent the 3 different curves in one smooth curve with a nice confidence interval?
I supply a test code, tested on python 3.5.1 with numpy and pandas (don't change the seed in order to get the same curves).
There are some constrains - increasing the number of points for the "rolling" function isn't a solution for me because some of my data is too short for that.
Test code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
np.random.seed(seed=42)
## data generation - cumulative analysis over time
df1_time = pd.DataFrame(np.random.uniform(0,1000,size=50), columns=['time'])
df1_values = pd.DataFrame(np.random.randint(0,10000,size=100), columns=['vals'])
df1_combined_sorted = pd.concat([df1_time, df1_values], axis = 1).sort_values(by=['time'])
df1_combined_sorted_cumulative = np.cumsum(df1_combined_sorted['vals'])
df2_time = pd.DataFrame(np.random.uniform(0,1000,size=50), columns=['time'])
df2_values = pd.DataFrame(np.random.randint(1000,13000,size=100), columns=['vals'])
df2_combined_sorted = pd.concat([df2_time, df2_values], axis = 1).sort_values(by=['time'])
df2_combined_sorted_cumulative = np.cumsum(df2_combined_sorted['vals'])
df3_time = pd.DataFrame(np.random.uniform(0,1000,size=50), columns=['time'])
df3_values = pd.DataFrame(np.random.randint(0,4000,size=100), columns=['vals'])
df3_combined_sorted = pd.concat([df3_time, df3_values], axis = 1).sort_values(by=['time'])
df3_combined_sorted_cumulative = np.cumsum(df3_combined_sorted['vals'])
## combining the three curves
df_all_vals_cumulative = pd.concat([df1_combined_sorted_cumulative,.
df2_combined_sorted_cumulative, df3_combined_sorted_cumulative]).reset_index(drop=True)
df_all_time = pd.concat([df1_combined_sorted['time'],
df2_combined_sorted['time'], df3_combined_sorted['time']]).reset_index(drop=True)
df_all = pd.concat([df_all_time, df_all_vals_cumulative], axis = 1)
## creating confidence intervals
df_all_sorted = df_all.sort_values(by=['time'])
ma = df_all_sorted.rolling(10).mean()
mstd = df_all_sorted.rolling(10).std()
## plotting
plt.fill_between(df_all_sorted['time'], ma['vals'] - 2 * mstd['vals'],
ma['vals'] + 2 * mstd['vals'],color='b', alpha=0.2)
plt.plot(df_all_sorted['time'],ma['vals'], c='purple')
plt.plot(df1_combined_sorted['time'], df1_combined_sorted_cumulative, c='blue')
plt.plot(df2_combined_sorted['time'], df2_combined_sorted_cumulative, c='blue')
plt.plot(df3_combined_sorted['time'], df3_combined_sorted_cumulative, c='blue')
matplotlib.use('Agg')
plt.show()
First of all, your sample code could be re-written to make better use of pd
. For example
np.random.seed(seed=42)
## data generation - cumulative analysis over time
def get_data(max_val, max_time=1000):
times = pd.DataFrame(np.random.uniform(0,max_time,size=50), columns=['time'])
vals = pd.DataFrame(np.random.randint(0,max_val,size=100), columns=['vals'])
df = pd.concat([times, vals], axis = 1).sort_values(by=['time']).\
reset_index().drop('index', axis=1)
df['cumulative'] = df.vals.cumsum()
return df
# generate the dataframes
df1,df2,df3 = (df for df in map(get_data, [10000, 13000, 4000]))
dfs = (df1, df2, df3)
# join
df_all = pd.concat(dfs, ignore_index=True).sort_values(by=['time'])
# render function
def render(window=10):
# compute rolling means and confident intervals
mean_val = df_all.cumulative.rolling(window).mean()
std_val = df_all.cumulative.rolling(window).std()
min_val = mean_val - 2*std_val
max_val = mean_val + 2*std_val
plt.figure(figsize=(16,9))
for df in dfs:
plt.plot(df.time, df.cumulative, c='blue')
plt.plot(df_all.time, mean_val, c='r')
plt.fill_between(df_all.time, min_val, max_val, color='blue', alpha=.2)
plt.show()
The reason your curves aren't that smooth is maybe your rolling window is not large enough. You can increase this window size to get smoother graphs. For example render(20)
gives:
while render(30)
gives:
Although, the better way might be imputing each of df['cumulative']
to the entire time window and compute the mean/confidence interval on these series. With that in mind, we can modify the code as follows:
np.random.seed(seed=42)
## data generation - cumulative analysis over time
def get_data(max_val, max_time=1000):
times = pd.DataFrame(np.random.uniform(0,max_time,size=50), columns=['time'])
vals = pd.DataFrame(np.random.randint(0,max_val,size=100), columns=['vals'])
# note that we set time as index of the returned data
df = pd.concat([times, vals], axis = 1).dropna().set_index('time').sort_index()
df['cumulative'] = df.vals.cumsum()
return df
df1,df2,df3 = (df for df in map(get_data, [10000, 13000, 4000]))
dfs = (df1, df2, df3)
# rename column for later plotting
for i,df in zip(range(3),dfs):
df.rename(columns={'cumulative':f'cummulative_{i}'}, inplace=True)
# concatenate the dataframes with common time index
df_all = pd.concat(dfs,sort=False).sort_index()
# interpolate each cumulative column linearly
df_all.interpolate(inplace=True)
# plot graphs
mean_val = df_all.iloc[:,1:].mean(axis=1)
std_val = df_all.iloc[:,1:].std(axis=1)
min_val = mean_val - 2*std_val
max_val = mean_val + 2*std_val
fig, ax = plt.subplots(1,1,figsize=(16,9))
df_all.iloc[:,1:4].plot(ax=ax)
plt.plot(df_all.index, mean_val, c='purple')
plt.fill_between(df_all.index, min_val, max_val, color='blue', alpha=.2)
plt.show()
and we get:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With