Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

using dplyr lead but with some contraints

Tags:

r

dplyr

I have this data frame, dat, and dplyr is used to add the "NextStatTime" field which is the start time after the End time for an ID and "Duration" which is the time from the End Time to the next start time for an ID.

The data looks like this:

     dat = data.frame(ID= c(1,1,1,2,3,3),
                      NumberInSequence= c(1,3,4,1,1,2),
                      StartTime = as.POSIXct(c("2016-01-01 05:52:05 GMT","2016-01-01 05:52:11 GMT","2016-01-01 05:52:16 GMT","2016-01-01 05:40:05 GMT","2016-01-01 06:12:13 GMT","2016-01-01 07:12:26 GMT"))  ,
                      EndTime = as.POSIXct(c("2016-01-01 05:52:10 GMT","2016-01-01 05:52:16 GMT","2016-01-01 05:52:30 GMT","2016-01-01 05:46:05 GMT","2016-01-01 06:12:25 GMT","2016-01-01 08:00:00 GMT")  )
                       )

    dat
    dat %>% group_by(ID) %>% mutate(NextStartTime = lead(StartTime), duration = as.numeric(difftime(NextStartTime, EndTime, units = 's')))

  ID NumberInSequence           StartTime             EndTime       NextStartTime duration
  <dbl>            <dbl>              <time>              <time>              <time>    <dbl>
1     1                1 2016-01-01 05:52:05 2016-01-01 05:52:10 2016-01-01 05:52:11        1
2     1                3 2016-01-01 05:52:11 2016-01-01 05:52:16 2016-01-01 05:52:16        0
3     1                4 2016-01-01 05:52:16 2016-01-01 05:52:30                <NA>       NA
4     2                1 2016-01-01 05:40:05 2016-01-01 05:46:05                <NA>       NA
5     3                1 2016-01-01 06:12:13 2016-01-01 06:12:25 2016-01-01 07:12:26     3601
6     3                2 2016-01-01 07:12:26 2016-01-01 08:00:00                <NA>       NA

That is very close to the right answer but if there is a missing ID it still calculates and is misleading.

For example - look at ID= 1 there are 3 entries with sequence numbers 1,3 and 4. There is no #2 in the sequence. It is missing so the NextStartTime and Duration for ID = 1 and Number in sequence = 1 should be NA NOT 05:52:11 and 1.

Is there a way to impose this logic?

Thank you.

like image 396
user3022875 Avatar asked Oct 18 '22 01:10

user3022875


2 Answers

I'd like to leave an alternative way using POSIXct. You can avoid the numeric-to-date conversion in the following way. If you use if_else() in the dplyr package, it cares about class. What you want to do is to keep the POSIXct class in the function. If you type typeof(dat$StartTime), you will realize that StartTime is in double. Its class is POSIXct. So you want to use NA_real_ and treat the NAs as POSIXct. Please make sure that you provide origin and your time zone. In my case, I needed to specify my time zone as "Asia/Tokyo" in order or produce the expected outcome.

library(dplyr)
group_by(dat, ID) %>%
mutate(NextStartTime = if_else(abs(NumberInSequence - lead(NumberInSequence)) != 1,
                               true = as.POSIXct(NA_real_, origin = "1970-01-01 00:00:00", tz = "Asia/Tokyo"),
                               false = lead(StartTime)),
       Duration = difftime(NextStartTime, EndTime, unit = "s"))


#     ID NumberInSequence           StartTime             EndTime       NextStartTime  Duration
#  <dbl>            <dbl>              <dttm>              <dttm>              <dttm>    <time>
#1     1                1 2016-01-01 05:52:05 2016-01-01 05:52:10                <NA>   NA secs
#2     1                3 2016-01-01 05:52:11 2016-01-01 05:52:16 2016-01-01 05:52:16    0 secs
#3     1                4 2016-01-01 05:52:16 2016-01-01 05:52:30                <NA>   NA secs
#4     2                1 2016-01-01 05:40:05 2016-01-01 05:46:05                <NA>   NA secs
#5     3                1 2016-01-01 06:12:13 2016-01-01 06:12:25 2016-01-01 07:12:26 3601 secs
#6     3                2 2016-01-01 07:12:26 2016-01-01 08:00:00                <NA>   NA secs
like image 126
jazzurro Avatar answered Oct 21 '22 06:10

jazzurro


Two options:


tidyr::complete

One option is to use tidyr::complete to fill in the missing rows, and use the previous method.

Downside: You get new mostly NA rows added. You could omit them after the fact with a careful filter call, though.
Upside: It's easy to write and understand, and preserves the original logic.

library(tidyverse)

dat %>% group_by(ID) %>% 
    complete(NumberInSequence = seq(max(NumberInSequence))) %>% 
    mutate(NextStartTime = lead(StartTime), 
           Duration = as.numeric(difftime(NextStartTime, EndTime, units = 's')))

## Source: local data frame [7 x 6]
## Groups: ID [3]
## 
##      ID NumberInSequence           StartTime             EndTime       NextStartTime Duration
##   <dbl>            <dbl>              <dttm>              <dttm>              <dttm>    <dbl>
## 1     1                1 2016-01-01 05:52:05 2016-01-01 05:52:10                <NA>       NA
## 2     1                2                <NA>                <NA> 2016-01-01 05:52:11       NA
## 3     1                3 2016-01-01 05:52:11 2016-01-01 05:52:16 2016-01-01 05:52:16        0
## 4     1                4 2016-01-01 05:52:16 2016-01-01 05:52:30                <NA>       NA
## 5     2                1 2016-01-01 05:40:05 2016-01-01 05:46:05                <NA>       NA
## 6     3                1 2016-01-01 06:12:13 2016-01-01 06:12:25 2016-01-01 07:12:26     3601
## 7     3                2 2016-01-01 07:12:26 2016-01-01 08:00:00                <NA>       NA

subset lead(StartTime) with ifelse

ifelse inconveniently strips attributes, so you can't do ifelse(lead(StartTime) == NumberInSequence + 1, lead(StartTime), NA) without reconverting the resulting integer back to POSIXct, which is a hassle. Instead, it's easier to subset with ifelse, passing an NA if it's not a match, so the vector indexed returns NA instead of nothing.

Downside: It's finicky to write in order to keep types.
Upside: No additional rows are added.

dat %>% group_by(ID) %>% 
    mutate(NextStartTime = lead(StartTime)[ifelse(lead(NumberInSequence) == (NumberInSequence + 1), TRUE, NA)], 
           duration = difftime(NextStartTime, EndTime, units = 's'))

## Source: local data frame [6 x 6]
## Groups: ID [3]
## 
##      ID NumberInSequence           StartTime             EndTime       NextStartTime  duration
##   <dbl>            <dbl>              <dttm>              <dttm>              <dttm>    <time>
## 1     1                1 2016-01-01 05:52:05 2016-01-01 05:52:10                <NA>   NA secs
## 2     1                3 2016-01-01 05:52:11 2016-01-01 05:52:16 2016-01-01 05:52:16    0 secs
## 3     1                4 2016-01-01 05:52:16 2016-01-01 05:52:30                <NA>   NA secs
## 4     2                1 2016-01-01 05:40:05 2016-01-01 05:46:05                <NA>   NA secs
## 5     3                1 2016-01-01 06:12:13 2016-01-01 06:12:25 2016-01-01 07:12:26 3601 secs
## 6     3                2 2016-01-01 07:12:26 2016-01-01 08:00:00                <NA>   NA secs
like image 33
alistaire Avatar answered Oct 21 '22 05:10

alistaire