The underlying dataset is generated by sensors. Each 6 seconds, each sensor sends a signal identifying all the people (who have fobs) in range. Ignoring the people, typical data looks like this:
SensorID timestamp
2 2015-08-04 09:56:32
2 2015-08-04 09:56:38
2 2015-08-05 18:45:20
3 2015-08-04 09:54:33
3 2015-08-04 09:54:39
3 2015-08-04 09:57:31
3 2015-08-04 09:58:09
3 2015-08-04 09:58:15
3 2015-08-04 09:58:33
3 2015-08-04 09:58:39
I would like to convert this into events with start and end times where consecutive signals from the same sensor (and fob) are considered to be part of the same event if they are less than 60 seconds apart.
So the above test data would be transformed to:
SensorID startTime endTime sensorCount duration
2 2015-08-04 09:56:32 2015-08-04 09:56:38 2 6 secs
2 2015-08-05 18:45:20 2015-08-05 18:45:20 1 0 secs
3 2015-08-04 09:54:33 2015-08-04 09:54:39 2 6 secs
3 2015-08-04 09:57:31 2015-08-04 09:58:39 5 68 secs
I have code that works.
# identify the ends of sequences
lastKeep <- df$SensorID != df$SensorID[-1L] |
difftime(df$timestamp[-1L], df$timestamp, units = "secs") > 60
# set startTime and cumulative time and number of signals
df$startTime <- df$timestamp
df$endTime <- df$timestamp
df$sensorCount <- 1
for(jj in 2:nrow(df)) {
if (lastKeep[jj-1] == FALSE) {
df$startTime[jj] = df$startTime[jj-1]
df$sensorCount[jj] = df$sensorCount[jj-1] + 1
}
}
# select combined records and create duration
df <- df[lastKeep,]
df$duration <- difftime(df$endTime, df$startTime, units = "secs")
df$timestamp <- NULL
However, this code takes several seconds for my actual test data of 2000 records and the full dataset is already 6.5 million records and still being collected. I therefore needs something that is efficient.
Is there a way to vectorise this despite its reliance on the 'previous' record to provide the cumulative time and signal count?
My current plan is to use Rcpp, but my C++ skills are mediocre at best. Alternatively, is there an R package that could collapse consecutive signal records? I couldn't find one in the time-series or signal processing world but they are not my fields so I may have missed something obvious.
Here's a possible data.table
solution using the devel version on GH, it should be efficient enough
library(data.table) #V 1.9.5+
setDT(df)[, timestamp := as.POSIXct(timestamp)] # Make sure it's a valid POSIXct class
df[, .(
startTime = timestamp[1L],
endTime = timestamp[.N],
sensorCount = .N,
duration = difftime(timestamp[.N], timestamp[1L], units = "secs")
),
by = .(SensorID,
cumsum(difftime(timestamp, shift(timestamp, fill = timestamp[1L]), "secs") > 60))]
# SensorID cumsum startTime endTime sensorCount duration
# 1: 2 0 2015-08-04 09:56:32 2015-08-04 09:56:38 2 6 secs
# 2: 2 1 2015-08-05 18:45:20 2015-08-05 18:45:20 1 0 secs
# 3: 3 1 2015-08-04 09:54:33 2015-08-04 09:54:39 2 6 secs
# 4: 3 2 2015-08-04 09:57:31 2015-08-04 09:58:39 5 68 secs
The idea here is to group by cumulative sum of time differences over 60 seconds within each sensor, then assign the first and last time stamps, the group count, the differences between the first and last time stamps per group.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With