i'm trying to determine the time difference between two observations. The data is broken up by different individuals who each have their own unique ID. I have a dataset which tells me what their status updates to every time it changes, and at what time their status changed. Status can be one of two values, and it always changes to the value it is not (in this case, from Y to N, or N to Y).
The data looks like this:
ID Status Time
1 Y 2013-07-01 08:07:00
2 Y 2013-07-01 08:07:03
3 Y 2013-07-01 08:07:04
4 Y 2013-07-01 08:07:06
1 N 2013-07-01 08:07:07
2 N 2013-07-01 08:07:23
5 Y 2013-07-01 08:07:34
6 Y 2013-07-01 08:07:45
7 Y 2013-07-01 08:07:47
1 Y 2013-07-01 08:07:56
3 N 2013-07-01 08:07:58
What I would like to find is the amount of time which passes between each status change for each individual ID -- that is, how long it takes to get from Y to N. And then get summary statistics like the distribution of elapsed times, mean of elapsed times, etc.
So an example output might look like this, recording the three Y to N switches which occurred above (1 switched, 2 switched, and 3 switched)
Y to N change Time elapsed (in seconds)
1 7
2 20
3 54
I'm having a lot of trouble with this for some reason. Right now I have the time in POSIXlt format, and the ID and status as a factor. I have tried using ddply to sort the data by ID and then by timestamp, but this hasn't worked so far. Any advice would be much appreciated!
edit: changed time to actually be in the correct type.
Edit2: ended up writing a solution while waiting for more answers. My way is much uglier than many of the solutions here, but I did:
N <- ifelse(df$Status=="N",1,0)
Y <- ifelse(df$Status== "Y",1,0)
#making a vector which is 1 for a row if the item status of the row below it is N
var1 <- N
for (i in 1:nrow(df)) {
var1[i] <- N[i+1]
}
#making a vector which is TRUE if a row's item status is Y and the row after is N
check <- ifelse(var1==s & var1==1,TRUE,FALSE)
#had to define the last one as FALSE manually because the for loop above would miss the last entry due to how it was constructed
check [50000]=FALSE
#made a loop which finds the time difference for a row's TIME and the row below it, given that "check " is true for that row, and writes that to a results vector.
#here is the results vector
results <- numeric(nrow(df))
#here is the for loop
for (i in 1:nrow(df)) {
if(check [i]){
results[i] <- difftime(df$Time[i],df$Time[i+1])
}
}
I originally had this solved with a for loop, but over the ~1 million rows of my actual dataset it was way too slow, so I did this vectorization stuff. Would these other solutions work on data that large? I will definitely be trying them out!
Here is another approach. I tried to leave all data in the final output here. Please note, for demonstration purposes, I modified your data a bit. In my code, I first arranged data by ID
and Time
. I, then, changed Status
(i.e.,Y and N) to 0 and 1 in order to create group
. Here, group
can tell us when Status
changed. If you see a same number going on for a few rows, that means Status
has not changed. I then, calculated time difference (i.e., gap
) for each ID. Finally, I changed gap
values which do not appear in the first row for each group to NA. That is, I made unnecessary gaps NAs. Please note that the first observation for each ID has NA in gap
as well. gap
is in second.
ann <- data.frame(ID = c(1,2,3,4,1,2,2,1,1,1,3),
Status = c("Y", "Y", "Y", "Y",
"N", "N", "Y", "Y", "Y", "N", "N"),
Time = c("2013-07-01 08:07:00", "2013-07-01 08:07:03",
"2013-07-01 08:07:04", "2013-07-01 08:07:06",
"2013-07-01 08:07:07", "2013-07-01 08:07:23",
"2013-07-01 08:07:34", "2013-07-01 08:07:45",
"2013-07-01 08:07:47", "2013-07-01 08:07:56",
"2013-07-01 08:07:58"),
stringsAsFactors = FALSE)
ann$Time <- as.POSIXct(ann$Time)
# ID Status Time
#1 1 Y 2013-07-01 08:07:00
#2 2 Y 2013-07-01 08:07:03
#3 3 Y 2013-07-01 08:07:04
#4 4 Y 2013-07-01 08:07:06
#5 1 N 2013-07-01 08:07:07
#6 2 N 2013-07-01 08:07:23
#7 2 Y 2013-07-01 08:07:34
#8 1 Y 2013-07-01 08:07:45
#9 1 Y 2013-07-01 08:07:47
#10 1 N 2013-07-01 08:07:56
#11 3 N 2013-07-01 08:07:58
ann %>%
arrange(ID, Time) %>%
group_by(ID) %>%
mutate(Status = ifelse(Status == "Y", 1, 0),
group = cumsum(c(T, diff(Status) != 0)),
gap = Time - lag(Time)) %>%
group_by(ID, group) %>%
mutate(gap = ifelse(row_number() != 1, NA, gap))
# ID Status Time group gap
#1 1 1 2013-07-01 08:07:00 1 NA
#2 1 0 2013-07-01 08:07:07 2 7
#3 1 1 2013-07-01 08:07:45 3 38
#4 1 1 2013-07-01 08:07:47 3 NA
#5 1 0 2013-07-01 08:07:56 4 9
#6 2 1 2013-07-01 08:07:03 1 NA
#7 2 0 2013-07-01 08:07:23 2 20
#8 2 1 2013-07-01 08:07:34 3 11
#9 3 1 2013-07-01 08:07:04 1 NA
#10 3 0 2013-07-01 08:07:58 2 54
#11 4 1 2013-07-01 08:07:06 1 NA
This seems to work on the sample data you provided, but those times are not POSIXlt. This finds the first Y
time and the first N
time, removes any IDs that don't have a transition from Y
to N
, and subtracts the first Y
time from the first N
time.
library('dplyr')
df <- read.table(text = "ID Status Time
1 Y 1
2 Y 2
3 Y 3.5
4 Y 4
1 N 5.8
2 N 6
5 Y 7
6 Y 8
7 Y 8.1
1 Y 11
3 N 12", header = TRUE)
df$ID <- as.factor(df$ID) # convert ID to factor
df %>%
group_by(ID, Status) %>%
summarize(Time = min(Time)) %>%
filter("N" %in% Status & "Y" %in% Status) %>%
summarize(Time_elapsed = Time[Status == "N"] - Time[Status == "Y"])
Result:
ID Time_elapsed
1 1 4.8
2 2 4.0
3 3 8.5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With