How to calculate time difference with previous row of a data.frame by group

Tags:

The problem I am trying to solve is that I have a data frame with a sorted POSIXct variable in it. Each row is categorized and I want to get the time differences between each row for each level and add that data back into a new variable. The reproducible problem is as below. The below function is just for creating sample data with random times for the purpose of this question.

random.time <- function(N, start, end) {   st <- as.POSIXct(start)   en <- as.POSIXct(end)   dt <- as.numeric(difftime(en, st, unit="sec"))   ev <- sort(runif(N, 0, dt))   rt <- st + ev   return(rt) }

The code for simulating the problem is as below:

set.seed(123) category <- sample(LETTERS[1:5], 20, replace=TRUE) randtime <- random.time(20, '2015/06/01 08:00:00', '2015/06/01 18:00:00') df <- data.frame(category, randtime)

The expected resulting data frame is as below:

>category randtime timediff (secs) >A  2015-06-01 09:05:00 0 >A  2015-06-01 09:06:30 90 >A  2015-06-01 09:10:00 210 >B  2015-06-01 10:18:58 0 >B  2015-06-01 10:19:58 60 >C  2015-06-01 08:14:00 0 >C  2015-06-01 08:16:30 150

Each subgroup in the output will have the first row with timediff value of 0 as there is no previous row. I was able to group by category and call the following function to calculate the differences but could not get it to collate the final output for all category groups.

getTimeDiff <- function(x) {   no_rows <- nrow(x)   if(no_rows > 1) {     for(i in 2:no_rows) {       t <- x[i, "randtime"] - x[i-1, "randtime"]     }   } }

I have been at this for two days now without luck so would greatly appreciate any help. Thanks.

519

asked Oct 07 '15 18:10

Mntester

1 Answers

Try this:

library(dplyr) df %>%   arrange(category, randtime) %>%   group_by(category) %>%   mutate(diff = randtime - lag(randtime),          diff_secs = as.numeric(diff, units = 'secs'))  #   category            randtime             diff   diff_secs #     (fctr)              (time)           (dfft)       (dbl) # 1        A 2015-06-01 11:10:54         NA hours          NA # 2        A 2015-06-01 15:35:04   4.402785 hours   15850.027 # 3        A 2015-06-01 17:01:22   1.438395 hours    5178.222 # 4        B 2015-06-01 08:14:46         NA hours          NA # 5        B 2015-06-01 16:53:43 518.955379 hours 1868239.364 # 6        B 2015-06-01 17:37:48  44.090950 hours  158727.420

You may also want to add replace(is.na(.), 0) to the chain.

121

answered Oct 09 '22 01:10

JasonAizkalns

Related questions
                            
                                How does MongoDB order their docs in one collection? [duplicate]
                            
                                Does std::string need to store its character in a contiguous piece of memory?
                            
                                Splitting large text file on every blank line
                            
                                How to add a Spark Dataframe to the bottom of another dataframe?
                            
                                Configure Django and Google Cloud Storage?
                            
                                Convert null values to empty array in Spark DataFrame
                            
                                Why is Crystal faster than Ruby?
                            
                                plot several image files in matplotlib subplots
                            
                                How do I add to an existing json file in node.js
                            
                                IDX10803: Unable to create to obtain configuration
                            
                                Pandas: Refer to column name, case insensitive
                            
                                Time complexity of string concatenation in Python [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With