Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding Time Difference Between Observations in R

Tags:

datetime

r

i'm trying to determine the time difference between two observations. The data is broken up by different individuals who each have their own unique ID. I have a dataset which tells me what their status updates to every time it changes, and at what time their status changed. Status can be one of two values, and it always changes to the value it is not (in this case, from Y to N, or N to Y).

The data looks like this:

ID Status Time
1    Y     2013-07-01 08:07:00      
2    Y     2013-07-01 08:07:03  
3    Y     2013-07-01 08:07:04      
4    Y     2013-07-01 08:07:06      
1    N     2013-07-01 08:07:07      
2    N     2013-07-01 08:07:23      
5    Y     2013-07-01 08:07:34  
6    Y     2013-07-01 08:07:45  
7    Y     2013-07-01 08:07:47  
1    Y     2013-07-01 08:07:56  
3    N     2013-07-01 08:07:58  

What I would like to find is the amount of time which passes between each status change for each individual ID -- that is, how long it takes to get from Y to N. And then get summary statistics like the distribution of elapsed times, mean of elapsed times, etc.

So an example output might look like this, recording the three Y to N switches which occurred above (1 switched, 2 switched, and 3 switched)

Y to N change    Time elapsed (in seconds)
1                     7 
2                     20
3                     54

I'm having a lot of trouble with this for some reason. Right now I have the time in POSIXlt format, and the ID and status as a factor. I have tried using ddply to sort the data by ID and then by timestamp, but this hasn't worked so far. Any advice would be much appreciated!

edit: changed time to actually be in the correct type.

Edit2: ended up writing a solution while waiting for more answers. My way is much uglier than many of the solutions here, but I did:

N <- ifelse(df$Status=="N",1,0)
Y <- ifelse(df$Status== "Y",1,0)

#making a vector which is 1 for a row if the item status of the row below it is N
var1 <- N
for (i in 1:nrow(df)) {
  var1[i] <- N[i+1]
}

#making a vector which is TRUE if a row's item status is Y and the row after is N
check <- ifelse(var1==s & var1==1,TRUE,FALSE)
#had to define the last one as FALSE manually because the for loop above would miss the last entry due to how it was constructed
check [50000]=FALSE



#made a loop which finds the time difference for a row's TIME and the row below it, given that "check " is true for that row, and writes that to a results vector.
#here is the results vector
results <- numeric(nrow(df))
#here is the for loop
for (i in 1:nrow(df)) {
  if(check [i]){
    results[i] <- difftime(df$Time[i],df$Time[i+1])
  }
}

I originally had this solved with a for loop, but over the ~1 million rows of my actual dataset it was way too slow, so I did this vectorization stuff. Would these other solutions work on data that large? I will definitely be trying them out!

like image 448
verybadatthis Avatar asked Nov 03 '14 22:11

verybadatthis


2 Answers

Here is another approach. I tried to leave all data in the final output here. Please note, for demonstration purposes, I modified your data a bit. In my code, I first arranged data by ID and Time. I, then, changed Status (i.e.,Y and N) to 0 and 1 in order to create group. Here, group can tell us when Status changed. If you see a same number going on for a few rows, that means Status has not changed. I then, calculated time difference (i.e., gap) for each ID. Finally, I changed gap values which do not appear in the first row for each group to NA. That is, I made unnecessary gaps NAs. Please note that the first observation for each ID has NA in gap as well. gap is in second.

ann <- data.frame(ID = c(1,2,3,4,1,2,2,1,1,1,3),
                  Status = c("Y", "Y", "Y", "Y",
                             "N", "N", "Y", "Y", "Y", "N", "N"),
                  Time = c("2013-07-01 08:07:00", "2013-07-01 08:07:03",
                           "2013-07-01 08:07:04", "2013-07-01 08:07:06",
                           "2013-07-01 08:07:07", "2013-07-01 08:07:23",
                           "2013-07-01 08:07:34", "2013-07-01 08:07:45",
                           "2013-07-01 08:07:47", "2013-07-01 08:07:56",
                           "2013-07-01 08:07:58"),
                  stringsAsFactors = FALSE)

ann$Time <- as.POSIXct(ann$Time)

#   ID Status                Time
#1   1      Y 2013-07-01 08:07:00
#2   2      Y 2013-07-01 08:07:03
#3   3      Y 2013-07-01 08:07:04
#4   4      Y 2013-07-01 08:07:06
#5   1      N 2013-07-01 08:07:07
#6   2      N 2013-07-01 08:07:23
#7   2      Y 2013-07-01 08:07:34
#8   1      Y 2013-07-01 08:07:45
#9   1      Y 2013-07-01 08:07:47
#10  1      N 2013-07-01 08:07:56
#11  3      N 2013-07-01 08:07:58

ann %>%
    arrange(ID, Time) %>%
    group_by(ID) %>%
    mutate(Status = ifelse(Status == "Y", 1, 0),
           group = cumsum(c(T, diff(Status) != 0)),
           gap = Time - lag(Time)) %>%
    group_by(ID, group) %>%
    mutate(gap = ifelse(row_number() != 1, NA, gap))

#   ID Status                Time group gap
#1   1      1 2013-07-01 08:07:00     1  NA
#2   1      0 2013-07-01 08:07:07     2   7
#3   1      1 2013-07-01 08:07:45     3  38
#4   1      1 2013-07-01 08:07:47     3  NA
#5   1      0 2013-07-01 08:07:56     4   9
#6   2      1 2013-07-01 08:07:03     1  NA
#7   2      0 2013-07-01 08:07:23     2  20
#8   2      1 2013-07-01 08:07:34     3  11
#9   3      1 2013-07-01 08:07:04     1  NA
#10  3      0 2013-07-01 08:07:58     2  54
#11  4      1 2013-07-01 08:07:06     1  NA
like image 61
jazzurro Avatar answered Nov 15 '22 03:11

jazzurro


This seems to work on the sample data you provided, but those times are not POSIXlt. This finds the first Y time and the first N time, removes any IDs that don't have a transition from Y to N, and subtracts the first Y time from the first N time.

library('dplyr')

df <- read.table(text = "ID Status Time
1    Y     1
2    Y     2
3    Y     3.5
4    Y     4
1    N     5.8
2    N     6
5    Y     7
6    Y     8
7    Y     8.1
1    Y     11
3    N     12", header = TRUE)
df$ID <- as.factor(df$ID) # convert ID to factor

df %>%
  group_by(ID, Status) %>%
  summarize(Time = min(Time)) %>%
  filter("N" %in% Status & "Y" %in% Status) %>%
  summarize(Time_elapsed = Time[Status == "N"] - Time[Status == "Y"])

Result:

  ID Time_elapsed
1  1          4.8
2  2          4.0
3  3          8.5
like image 38
Kara Woo Avatar answered Nov 15 '22 04:11

Kara Woo