The problem I am trying to solve is that I have a data frame with a sorted POSIXct variable in it. Each row is categorized and I want to get the time differences between each row for each level and add that data back into a new variable. The reproducible problem is as below. The below function is just for creating sample data with random times for the purpose of this question.
random.time <- function(N, start, end) { st <- as.POSIXct(start) en <- as.POSIXct(end) dt <- as.numeric(difftime(en, st, unit="sec")) ev <- sort(runif(N, 0, dt)) rt <- st + ev return(rt) }
The code for simulating the problem is as below:
set.seed(123) category <- sample(LETTERS[1:5], 20, replace=TRUE) randtime <- random.time(20, '2015/06/01 08:00:00', '2015/06/01 18:00:00') df <- data.frame(category, randtime)
The expected resulting data frame is as below:
>category randtime timediff (secs) >A 2015-06-01 09:05:00 0 >A 2015-06-01 09:06:30 90 >A 2015-06-01 09:10:00 210 >B 2015-06-01 10:18:58 0 >B 2015-06-01 10:19:58 60 >C 2015-06-01 08:14:00 0 >C 2015-06-01 08:16:30 150
Each subgroup in the output will have the first row with timediff value of 0 as there is no previous row. I was able to group by category and call the following function to calculate the differences but could not get it to collate the final output for all category groups.
getTimeDiff <- function(x) { no_rows <- nrow(x) if(no_rows > 1) { for(i in 2:no_rows) { t <- x[i, "randtime"] - x[i-1, "randtime"] } } }
I have been at this for two days now without luck so would greatly appreciate any help. Thanks.
The data frame indexing methods can be used to calculate the difference of rows by group in R. The 'by' attribute is to specify the column to group the data by. All the rows are retained, while a new column is added in the set of columns, using the column to take to compute the difference of rows by the group.
You can use the DataFrame. diff() function to find the difference between two rows in a pandas DataFrame. where: periods: The number of previous rows for calculating the difference.
Try this:
library(dplyr) df %>% arrange(category, randtime) %>% group_by(category) %>% mutate(diff = randtime - lag(randtime), diff_secs = as.numeric(diff, units = 'secs')) # category randtime diff diff_secs # (fctr) (time) (dfft) (dbl) # 1 A 2015-06-01 11:10:54 NA hours NA # 2 A 2015-06-01 15:35:04 4.402785 hours 15850.027 # 3 A 2015-06-01 17:01:22 1.438395 hours 5178.222 # 4 B 2015-06-01 08:14:46 NA hours NA # 5 B 2015-06-01 16:53:43 518.955379 hours 1868239.364 # 6 B 2015-06-01 17:37:48 44.090950 hours 158727.420
You may also want to add replace(is.na(.), 0)
to the chain.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With