I have this data frame, dat, and dplyr is used to add the "NextStatTime" field which is the start time after the End time for an ID and "Duration" which is the time from the End Time to the next start time for an ID.
The data looks like this:
dat = data.frame(ID= c(1,1,1,2,3,3),
NumberInSequence= c(1,3,4,1,1,2),
StartTime = as.POSIXct(c("2016-01-01 05:52:05 GMT","2016-01-01 05:52:11 GMT","2016-01-01 05:52:16 GMT","2016-01-01 05:40:05 GMT","2016-01-01 06:12:13 GMT","2016-01-01 07:12:26 GMT")) ,
EndTime = as.POSIXct(c("2016-01-01 05:52:10 GMT","2016-01-01 05:52:16 GMT","2016-01-01 05:52:30 GMT","2016-01-01 05:46:05 GMT","2016-01-01 06:12:25 GMT","2016-01-01 08:00:00 GMT") )
)
dat
dat %>% group_by(ID) %>% mutate(NextStartTime = lead(StartTime), duration = as.numeric(difftime(NextStartTime, EndTime, units = 's')))
ID NumberInSequence StartTime EndTime NextStartTime duration
<dbl> <dbl> <time> <time> <time> <dbl>
1 1 1 2016-01-01 05:52:05 2016-01-01 05:52:10 2016-01-01 05:52:11 1
2 1 3 2016-01-01 05:52:11 2016-01-01 05:52:16 2016-01-01 05:52:16 0
3 1 4 2016-01-01 05:52:16 2016-01-01 05:52:30 <NA> NA
4 2 1 2016-01-01 05:40:05 2016-01-01 05:46:05 <NA> NA
5 3 1 2016-01-01 06:12:13 2016-01-01 06:12:25 2016-01-01 07:12:26 3601
6 3 2 2016-01-01 07:12:26 2016-01-01 08:00:00 <NA> NA
That is very close to the right answer but if there is a missing ID it still calculates and is misleading.
For example - look at ID= 1 there are 3 entries with sequence numbers 1,3 and 4. There is no #2 in the sequence. It is missing so the NextStartTime and Duration for ID = 1 and Number in sequence = 1 should be NA NOT 05:52:11 and 1.
Is there a way to impose this logic?
Thank you.
I'd like to leave an alternative way using POSIXct
. You can avoid the numeric-to-date conversion in the following way. If you use if_else()
in the dplyr package, it cares about class. What you want to do is to keep the POSIXct
class in the function. If you type typeof(dat$StartTime)
, you will realize that StartTime
is in double. Its class is POSIXct. So you want to use NA_real_
and treat the NAs as POSIXct. Please make sure that you provide origin and your time zone. In my case, I needed to specify my time zone as "Asia/Tokyo" in order or produce the expected outcome.
library(dplyr)
group_by(dat, ID) %>%
mutate(NextStartTime = if_else(abs(NumberInSequence - lead(NumberInSequence)) != 1,
true = as.POSIXct(NA_real_, origin = "1970-01-01 00:00:00", tz = "Asia/Tokyo"),
false = lead(StartTime)),
Duration = difftime(NextStartTime, EndTime, unit = "s"))
# ID NumberInSequence StartTime EndTime NextStartTime Duration
# <dbl> <dbl> <dttm> <dttm> <dttm> <time>
#1 1 1 2016-01-01 05:52:05 2016-01-01 05:52:10 <NA> NA secs
#2 1 3 2016-01-01 05:52:11 2016-01-01 05:52:16 2016-01-01 05:52:16 0 secs
#3 1 4 2016-01-01 05:52:16 2016-01-01 05:52:30 <NA> NA secs
#4 2 1 2016-01-01 05:40:05 2016-01-01 05:46:05 <NA> NA secs
#5 3 1 2016-01-01 06:12:13 2016-01-01 06:12:25 2016-01-01 07:12:26 3601 secs
#6 3 2 2016-01-01 07:12:26 2016-01-01 08:00:00 <NA> NA secs
Two options:
tidyr::complete
One option is to use tidyr::complete
to fill in the missing rows, and use the previous method.
Downside: You get new mostly NA
rows added. You could omit them after the fact with a careful filter
call, though.
Upside: It's easy to write and understand, and preserves the original logic.
library(tidyverse)
dat %>% group_by(ID) %>%
complete(NumberInSequence = seq(max(NumberInSequence))) %>%
mutate(NextStartTime = lead(StartTime),
Duration = as.numeric(difftime(NextStartTime, EndTime, units = 's')))
## Source: local data frame [7 x 6]
## Groups: ID [3]
##
## ID NumberInSequence StartTime EndTime NextStartTime Duration
## <dbl> <dbl> <dttm> <dttm> <dttm> <dbl>
## 1 1 1 2016-01-01 05:52:05 2016-01-01 05:52:10 <NA> NA
## 2 1 2 <NA> <NA> 2016-01-01 05:52:11 NA
## 3 1 3 2016-01-01 05:52:11 2016-01-01 05:52:16 2016-01-01 05:52:16 0
## 4 1 4 2016-01-01 05:52:16 2016-01-01 05:52:30 <NA> NA
## 5 2 1 2016-01-01 05:40:05 2016-01-01 05:46:05 <NA> NA
## 6 3 1 2016-01-01 06:12:13 2016-01-01 06:12:25 2016-01-01 07:12:26 3601
## 7 3 2 2016-01-01 07:12:26 2016-01-01 08:00:00 <NA> NA
lead(StartTime)
with ifelse
ifelse
inconveniently strips attributes, so you can't do ifelse(lead(StartTime) == NumberInSequence + 1, lead(StartTime), NA)
without reconverting the resulting integer back to POSIXct, which is a hassle. Instead, it's easier to subset with ifelse
, passing an NA
if it's not a match, so the vector indexed returns NA
instead of nothing.
Downside: It's finicky to write in order to keep types.
Upside: No additional rows are added.
dat %>% group_by(ID) %>%
mutate(NextStartTime = lead(StartTime)[ifelse(lead(NumberInSequence) == (NumberInSequence + 1), TRUE, NA)],
duration = difftime(NextStartTime, EndTime, units = 's'))
## Source: local data frame [6 x 6]
## Groups: ID [3]
##
## ID NumberInSequence StartTime EndTime NextStartTime duration
## <dbl> <dbl> <dttm> <dttm> <dttm> <time>
## 1 1 1 2016-01-01 05:52:05 2016-01-01 05:52:10 <NA> NA secs
## 2 1 3 2016-01-01 05:52:11 2016-01-01 05:52:16 2016-01-01 05:52:16 0 secs
## 3 1 4 2016-01-01 05:52:16 2016-01-01 05:52:30 <NA> NA secs
## 4 2 1 2016-01-01 05:40:05 2016-01-01 05:46:05 <NA> NA secs
## 5 3 1 2016-01-01 06:12:13 2016-01-01 06:12:25 2016-01-01 07:12:26 3601 secs
## 6 3 2 2016-01-01 07:12:26 2016-01-01 08:00:00 <NA> NA secs
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With