Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculate difference between values in consecutive rows by group

Tags:

r

This is a my df (data.frame):

group value 1     10 1     20 1     25 2     5 2     10 2     15  

I need to calculate difference between values in consecutive rows by group.

So, I need a that result.

group value diff 1     10    NA # because there is a no previous value 1     20    10 # value[2] - value[1] 1     25    5  # value[3] value[2] 2     5     NA # because group is changed 2     10    5  # value[5] - value[4] 2     15    5  # value[6] - value[5] 

Although, I can handle this problem by using ddply, but it takes too much time. This is because I have a lot of groups in my df. (over 1,000,000 groups in my df)

Are there any other effective approaches to handle this problem?

like image 522
kmangyo Avatar asked Feb 13 '13 04:02

kmangyo


People also ask

How do you find the difference between consecutive rows in R?

diff() method in base R is used to find the difference among all the pairs of consecutive rows in the R dataframe. It returns a vector with the length equivalent to the length of the input column – 1.

How do I compare two consecutive rows in SQL?

Here's the SQL query to compare each row with previous row. In the above query, we join sales table with itself using an INNER JOIN condition g2.id=g1.id + 1 that allows you to compare each row with its previous row. Please note, this condition depends on the fact that our id column has consecutive numbers.

How do you calculate row difference?

The difference is calculated by using the particular row of the specified column and subtracting from it the previous value computed using the shift() method.


2 Answers

The package data.table can do this fairly quickly, using the shift function.

require(data.table) df <- data.table(group = rep(c(1, 2), each = 3), value = c(10,20,25,5,10,15)) #setDT(df) #if df is already a data frame  df[ , diff := value - shift(value), by = group]     #   group value diff #1:     1    10   NA #2:     1    20   10 #3:     1    25    5 #4:     2     5   NA #5:     2    10    5 #6:     2    15    5 setDF(df) #if you want to convert back to old data.frame syntax 

Or using the lag function in dplyr

df %>%     group_by(group) %>%     mutate(Diff = value - lag(value)) #   group value  Diff #   <int> <int> <int> # 1     1    10    NA # 2     1    20    10 # 3     1    25     5 # 4     2     5    NA # 5     2    10     5 # 6     2    15     5 

For alternatives pre-data.table::shift and pre-dplyr::lag, see edits.

like image 128
Blue Magister Avatar answered Oct 17 '22 12:10

Blue Magister


You can use the base function ave() for this

df <- data.frame(group=rep(c(1,2),each=3),value=c(10,20,25,5,10,15)) df$diff <- ave(df$value, factor(df$group), FUN=function(x) c(NA,diff(x))) 

which returns

  group value diff 1     1    10   NA 2     1    20   10 3     1    25    5 4     2     5   NA 5     2    10    5 6     2    15    5 
like image 25
MrFlick Avatar answered Oct 17 '22 12:10

MrFlick