I am trying to determine what percentage of the time that two time series overlap using python's pandas library. The data is nonsynchronous so the times for each data point do not line up. Here is an example:
Time Series 1
2016-10-05 11:50:02.000734 0.50
2016-10-05 11:50:03.000033 0.25
2016-10-05 11:50:10.000479 0.50
2016-10-05 11:50:15.000234 0.25
2016-10-05 11:50:37.000199 0.50
2016-10-05 11:50:49.000401 0.50
2016-10-05 11:50:51.000362 0.25
2016-10-05 11:50:53.000424 0.75
2016-10-05 11:50:53.000982 0.25
2016-10-05 11:50:58.000606 0.75
Time Series 2
2016-10-05 11:50:07.000537 0.50
2016-10-05 11:50:11.000994 0.50
2016-10-05 11:50:19.000181 0.50
2016-10-05 11:50:35.000578 0.50
2016-10-05 11:50:46.000761 0.50
2016-10-05 11:50:49.000295 0.75
2016-10-05 11:50:51.000835 0.75
2016-10-05 11:50:55.000792 0.25
2016-10-05 11:50:55.000904 0.75
2016-10-05 11:50:57.000444 0.75
Assuming the series holds its value until the next change what is the most efficient way to determine the percentage of time that they have the same value?
Example
Lets calculate the time that these series overlap starting at 11:50:07.000537 and ending at 2016-10-05 11:50:57.000444 0.75 since we have data for both series for that period. Time that there is overlap:
The result (4.999755+12.000096+0.000558+0.000112) / 49.999907 = 34%
One of the issues is my actual timeseries has much more data such as 1000 - 10000 observations and I need to run many more pairs. I thought about forward filling a series and then simply comparing the rows and dividing the total number of matches over the total number of rows but I do not think this would be very efficient.
setup
create 2 time series
from StringIO import StringIO
import pandas as pd
txt1 = """2016-10-05 11:50:02.000734 0.50
2016-10-05 11:50:03.000033 0.25
2016-10-05 11:50:10.000479 0.50
2016-10-05 11:50:15.000234 0.25
2016-10-05 11:50:37.000199 0.50
2016-10-05 11:50:49.000401 0.50
2016-10-05 11:50:51.000362 0.25
2016-10-05 11:50:53.000424 0.75
2016-10-05 11:50:53.000982 0.25
2016-10-05 11:50:58.000606 0.75"""
s1 = pd.read_csv(StringIO(txt1), sep='\s{2,}', engine='python',
parse_dates=[0], index_col=0, header=None,
squeeze=True).rename('s1').rename_axis(None)
txt2 = """2016-10-05 11:50:07.000537 0.50
2016-10-05 11:50:11.000994 0.50
2016-10-05 11:50:19.000181 0.50
2016-10-05 11:50:35.000578 0.50
2016-10-05 11:50:46.000761 0.50
2016-10-05 11:50:49.000295 0.75
2016-10-05 11:50:51.000835 0.75
2016-10-05 11:50:55.000792 0.25
2016-10-05 11:50:55.000904 0.75
2016-10-05 11:50:57.000444 0.75"""
s2 = pd.read_csv(StringIO(txt2), sep='\s{2,}', engine='python',
parse_dates=[0], index_col=0, header=None,
squeeze=True).rename('s2').rename_axis(None)
TL;DR
df = pd.concat([s1, s2], axis=1).ffill().dropna()
overlap = df.index.to_series().diff().shift(-1) \
.fillna(0).groupby(df.s1.eq(df.s2)).sum()
overlap.div(overlap.sum())
False 0.666657
True 0.333343
Name: duration, dtype: float64
explanation
build base pd.DataFrame
df
pd.concat
to align indexesffill
to let values propagate forwarddropna
to get rid of values of one series prior to the other startingdf = pd.concat([s1, s2], axis=1).ffill().dropna()
df
calculate 'duration'
from current time stamp to next
df['duration'] = df.index.to_series().diff().shift(-1).fillna(0)
df
calculate overlap
df.s1.eq(df.s2)
gives boolean series of when s1
overlaps with s2
groupby
above boolean series to aggregate total duration when True
and False
overlap = df.groupby(df.s1.eq(df.s2)).duration.sum()
overlap
False 00:00:33.999548
True 00:00:17.000521
Name: duration, dtype: timedelta64[ns]
percentage of time with same value
overlap.div(overlap.sum())
False 0.666657
True 0.333343
Name: duration, dtype: float64
Cool problem. I brute forced this w/out using pandas or numpy, but I got your answer (thanks for working it out). I have not tested it on anything else. I also don't know how fast it is since it only goes through each dataframe once, but does not do any vectorization.
import pandas as pd
#############################################################################
#Preparing the dataframes
times_1 = ["2016-10-05 11:50:02.000734","2016-10-05 11:50:03.000033",
"2016-10-05 11:50:10.000479","2016-10-05 11:50:15.000234",
"2016-10-05 11:50:37.000199","2016-10-05 11:50:49.000401",
"2016-10-05 11:50:51.000362","2016-10-05 11:50:53.000424",
"2016-10-05 11:50:53.000982","2016-10-05 11:50:58.000606"]
times_1 = [pd.Timestamp(t) for t in times_1]
vals_1 = [0.50,0.25,0.50,0.25,0.50,0.50,0.25,0.75,0.25,0.75]
times_2 = ["2016-10-05 11:50:07.000537","2016-10-05 11:50:11.000994",
"2016-10-05 11:50:19.000181","2016-10-05 11:50:35.000578",
"2016-10-05 11:50:46.000761","2016-10-05 11:50:49.000295",
"2016-10-05 11:50:51.000835","2016-10-05 11:50:55.000792",
"2016-10-05 11:50:55.000904","2016-10-05 11:50:57.000444"]
times_2 = [pd.Timestamp(t) for t in times_2]
vals_2 = [0.50,0.50,0.50,0.50,0.50,0.75,0.75,0.25,0.75,0.75]
data_1 = pd.DataFrame({"time":times_1,"vals":vals_1})
data_2 = pd.DataFrame({"time":times_2,"vals":vals_2})
#############################################################################
shared_time = 0 #Keep running tally of shared time
t1_ind = 0 #Pointer to row in data_1 dataframe
t2_ind = 0 #Pointer to row in data_2 dataframe
#Loop through both dataframes once, incrementing either the t1 or t2 index
#Stop one before the end of both since do +1 indexing in loop
while t1_ind < len(data_1.time)-1 and t2_ind < len(data_2.time)-1:
#Get val1 and val2
val1,val2 = data_1.vals[t1_ind], data_2.vals[t2_ind]
#Get the start and stop of the current time window
t1_start,t1_stop = data_1.time[t1_ind], data_1.time[t1_ind+1]
t2_start,t2_stop = data_2.time[t2_ind], data_2.time[t2_ind+1]
#If the start of time window 2 is in time window 1
if val1 == val2 and (t1_start <= t2_start <= t1_stop):
shared_time += (min(t1_stop,t2_stop)-t2_start).total_seconds()
t1_ind += 1
#If the start of time window 1 is in time window 2
elif val1 == val2 and t2_start <= t1_start <= t2_stop:
shared_time += (min(t1_stop,t2_stop)-t1_start).total_seconds()
t2_ind += 1
#If there is no time window overlap and time window 2 is larger
elif t1_start < t2_start:
t1_ind += 1
#If there is no time window overlap and time window 1 is larger
else:
t2_ind += 1
#How I calculated the maximum possible shared time (not pretty)
shared_start = max(data_1.time[0],data_2.time[0])
shared_stop = min(data_1.time.iloc[-1],data_2.time.iloc[-1])
max_possible_shared = (shared_stop-shared_start).total_seconds()
#Print output
print "Shared time:",shared_time
print "Total possible shared:",max_possible_shared
print "Percent shared:",shared_time*100/max_possible_shared,"%"
Output:
Shared time: 17.000521
Total possible shared: 49.999907
Percent shared: 34.0011052421 %
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With