Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently combine timestamped sensor data into events in R

The underlying dataset is generated by sensors. Each 6 seconds, each sensor sends a signal identifying all the people (who have fobs) in range. Ignoring the people, typical data looks like this:

   SensorID   timestamp
     2        2015-08-04 09:56:32
     2        2015-08-04 09:56:38
     2        2015-08-05 18:45:20
     3        2015-08-04 09:54:33
     3        2015-08-04 09:54:39
     3        2015-08-04 09:57:31
     3        2015-08-04 09:58:09
     3        2015-08-04 09:58:15
     3        2015-08-04 09:58:33
     3        2015-08-04 09:58:39

I would like to convert this into events with start and end times where consecutive signals from the same sensor (and fob) are considered to be part of the same event if they are less than 60 seconds apart.

So the above test data would be transformed to:

SensorID  startTime              endTime              sensorCount   duration
2         2015-08-04 09:56:32    2015-08-04 09:56:38  2             6 secs
2         2015-08-05 18:45:20    2015-08-05 18:45:20  1             0 secs
3         2015-08-04 09:54:33    2015-08-04 09:54:39  2             6 secs
3         2015-08-04 09:57:31    2015-08-04 09:58:39  5             68 secs

I have code that works.

# identify the ends of sequences
lastKeep <- df$SensorID != df$SensorID[-1L] |
  difftime(df$timestamp[-1L], df$timestamp, units = "secs") > 60

# set startTime and cumulative time and number of signals
df$startTime <- df$timestamp
df$endTime <- df$timestamp
df$sensorCount <- 1

for(jj in 2:nrow(df)) {
 if (lastKeep[jj-1] == FALSE) {
   df$startTime[jj] = df$startTime[jj-1]
   df$sensorCount[jj] = df$sensorCount[jj-1] + 1
 }
}

# select combined records and create duration
df <- df[lastKeep,]
df$duration <- difftime(df$endTime, df$startTime, units = "secs")
df$timestamp <- NULL

However, this code takes several seconds for my actual test data of 2000 records and the full dataset is already 6.5 million records and still being collected. I therefore needs something that is efficient.

Is there a way to vectorise this despite its reliance on the 'previous' record to provide the cumulative time and signal count?

My current plan is to use Rcpp, but my C++ skills are mediocre at best. Alternatively, is there an R package that could collapse consecutive signal records? I couldn't find one in the time-series or signal processing world but they are not my fields so I may have missed something obvious.

like image 807
JenB Avatar asked Dec 25 '22 15:12

JenB


1 Answers

Here's a possible data.table solution using the devel version on GH, it should be efficient enough

library(data.table) #V 1.9.5+
setDT(df)[, timestamp := as.POSIXct(timestamp)] # Make sure it's a valid POSIXct class
df[, .(
       startTime = timestamp[1L], 
       endTime = timestamp[.N], 
       sensorCount = .N,  
       duration = difftime(timestamp[.N], timestamp[1L], units = "secs")
      ), 
      by = .(SensorID, 
       cumsum(difftime(timestamp, shift(timestamp, fill = timestamp[1L]), "secs") > 60))]
#    SensorID cumsum           startTime             endTime sensorCount duration
# 1:        2      0 2015-08-04 09:56:32 2015-08-04 09:56:38           2   6 secs
# 2:        2      1 2015-08-05 18:45:20 2015-08-05 18:45:20           1   0 secs
# 3:        3      1 2015-08-04 09:54:33 2015-08-04 09:54:39           2   6 secs
# 4:        3      2 2015-08-04 09:57:31 2015-08-04 09:58:39           5  68 secs

The idea here is to group by cumulative sum of time differences over 60 seconds within each sensor, then assign the first and last time stamps, the group count, the differences between the first and last time stamps per group.

like image 161
David Arenburg Avatar answered Dec 28 '22 08:12

David Arenburg