Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Index of next occurring record

I have a sample dataset of the trajectory of one bike. My objective is to figure out, on average, the amount of time that lapses in between visits to station B.

So far, I have been able to simply order the dataset with:

test[order(test$starttime, decreasing = FALSE),]

and find the row index of where start_station and end_station equal B.

 which(test$start_station == 'B')
 which(test$end_station == 'B')

The next part is where I run into trouble. In order to calculate the time that lapses in between when the bike is at Station B, we must take the difftime() between where start_station = "B" (bike leaves) and the next occurring record where end_station= "B", even if the record happens to be in the same row (see row 6).

Using the dataset below, we know that the bike spent 510 minutes between 7:30:00 and 16:00:00 outside of Station B, 30 minutes between 18:00:00 and 18:30:00 outside of Station B, and 210 minutes between 19:00:00 and 22:30:00 outside of Station B, which averages to 250 minutes.

How would one reproduce this output in R using difftime()?

> test
   bikeid start_station           starttime end_station             endtime
1       1             A 2017-09-25 01:00:00           B 2017-09-25 01:30:00
2       1             B 2017-09-25 07:30:00           C 2017-09-25 08:00:00
3       1             C 2017-09-25 10:00:00           A 2017-09-25 10:30:00
4       1             A 2017-09-25 13:00:00           C 2017-09-25 13:30:00
5       1             C 2017-09-25 15:30:00           B 2017-09-25 16:00:00
6       1             B 2017-09-25 18:00:00           B 2017-09-25 18:30:00
7       1             B 2017-09-25 19:00:00           A 2017-09-25 19:30:00
8       1             А 2017-09-25 20:00:00           C 2017-09-25 20:30:00
9       1             C 2017-09-25 22:00:00           B 2017-09-25 22:30:00
10      1             B 2017-09-25 23:00:00           C 2017-09-25 23:30:00

Here is the sample data:

> dput(test)
structure(list(bikeid = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), start_station = c("A", 
"B", "C", "A", "C", "B", "B", "А", "C", "B"), starttime = structure(c(1506315600, 
1506339000, 1506348000, 1506358800, 1506367800, 1506376800, 1506380400, 
1506384000, 1506391200, 1506394800), class = c("POSIXct", "POSIXt"
), tzone = ""), end_station = c("B", "C", "A", "C", "B", "B", 
"A", "C", "B", "C"), endtime = structure(c(1506317400, 1506340800, 
1506349800, 1506360600, 1506369600, 1506378600, 1506382200, 1506385800, 
1506393000, 1506396600), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("bikeid", 
"start_station", "starttime", "end_station", "endtime"), row.names = c(NA, 
-10L), class = "data.frame")
like image 835
the_darkside Avatar asked Oct 29 '22 01:10

the_darkside


1 Answers

This will calculate the difference as asked in the order it occurs, but does not append it to the data.frame

lapply(df1$starttime[df1$start_station == "B"], function(x, et) difftime(et[x < et][1], x, units = "mins"), et = df1$endtime[df1$end_station == "B"])

[[1]]
Time difference of 510 mins

[[2]]
Time difference of 30 mins

[[3]]
Time difference of 210 mins

[[4]]
Time difference of NA mins

To calculate the average time:

v1 <- sapply(df1$starttime[df1$start_station == "B"], function(x, et) difftime(et[x < et][1], x, units = "mins"), et = df1$endtime[df1$end_station == "B"])
mean(v1, na.rm = TRUE)

[1] 250
like image 180
manotheshark Avatar answered Nov 09 '22 12:11

manotheshark