Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

grouping events in a time series with R

Tags:

r

time-series

I've been doing some logging to try and illustrate to Comcast Business the frequency of their service interruptions at my office. I'm logging ping response times to a file then parsing that file with R. In the log file a value of 1000 means the ping timed out. My script logs the pings every 5 seconds. So if my Comcast service is down for 30 seconds that would result in ~6 log entries with value of 1000. I'd like to parse my logs in such a way that I could create a summary table that showed when each outage started, and how long it lasted. What are some good ways to do this?

Here's some example data from today and some graphs that illustrate my time series:

require(xts)
outFile <- "http://pastebin.com/raw.php?i=SJuMQ9rD"
pingLog <- read.csv(outFile, header=FALSE, 
     col.names = c("time","ms"), 
     colClasses=c("POSIXct", "numeric"))
xPingLog <- as.xts(pingLog$ms, order.by=pingLog$time)
outages <- subset(pingLog, ms==1000)
xOutages <- as.xts(outages$ms, order.by=outages$time)

par(mfrow=c(2,1))
plot(xPingLog)
plot(outages)
outages
like image 932
JD Long Avatar asked Feb 03 '23 11:02

JD Long


1 Answers

You've got to love Run length encoding, alias rle:

offline <- ifelse(pingLog$ms==1000, TRUE, FALSE)
rleOffline <- rle(offline)

offlineTable <- data.frame(
    endtime = pingLog$time[cumsum(rleOffline$lengths)],
    duration = rleOffline$lengths * 5,
    offline = rleOffline$values
)

Results in:

offlineTable

              endtime duration offline
1 2011-11-20 13:20:19     1030   FALSE
2 2011-11-20 13:20:35        5    TRUE
3 2011-11-20 13:24:37      240   FALSE
4 2011-11-20 13:25:57       25    TRUE
5 2011-11-20 13:53:28     1640   FALSE

Why does this work?

First construct a logical vector that indicates online vs. offline. ifelse is handy for this.

offline <- ifelse(pingLog$ms==1000, TRUE, FALSE)

Then use rle to calculate the run length encoding:

rle(offline)
Run Length Encoding
  lengths: int [1:5] 206 1 48 5 328
  values : logi [1:5] FALSE TRUE FALSE TRUE FALSE

This table tells how how many runs of either TRUE or FALSE occurred, and also how long each run was. In this case, the first run was 206 periods with a value of FALSE (i.e. online for 206*5=1030 seconds.

The final step is to use the rle information to index against the original pingLog to find the times. The extra bit of magic is to use cumsum to calculate the cumulative sum of the run-lengths. The real-world meaning of this is the index position where each run terminated.

like image 62
Andrie Avatar answered Feb 05 '23 02:02

Andrie