This is a my df
(data.frame):
group value 1 10 1 20 1 25 2 5 2 10 2 15
I need to calculate difference between values in consecutive rows by group.
So, I need a that result.
group value diff 1 10 NA # because there is a no previous value 1 20 10 # value[2] - value[1] 1 25 5 # value[3] value[2] 2 5 NA # because group is changed 2 10 5 # value[5] - value[4] 2 15 5 # value[6] - value[5]
Although, I can handle this problem by using ddply
, but it takes too much time. This is because I have a lot of groups in my df
. (over 1,000,000 groups in my df
)
Are there any other effective approaches to handle this problem?
diff() method in base R is used to find the difference among all the pairs of consecutive rows in the R dataframe. It returns a vector with the length equivalent to the length of the input column – 1.
Here's the SQL query to compare each row with previous row. In the above query, we join sales table with itself using an INNER JOIN condition g2.id=g1.id + 1 that allows you to compare each row with its previous row. Please note, this condition depends on the fact that our id column has consecutive numbers.
The difference is calculated by using the particular row of the specified column and subtracting from it the previous value computed using the shift() method.
The package data.table
can do this fairly quickly, using the shift
function.
require(data.table) df <- data.table(group = rep(c(1, 2), each = 3), value = c(10,20,25,5,10,15)) #setDT(df) #if df is already a data frame df[ , diff := value - shift(value), by = group] # group value diff #1: 1 10 NA #2: 1 20 10 #3: 1 25 5 #4: 2 5 NA #5: 2 10 5 #6: 2 15 5 setDF(df) #if you want to convert back to old data.frame syntax
Or using the lag
function in dplyr
df %>% group_by(group) %>% mutate(Diff = value - lag(value)) # group value Diff # <int> <int> <int> # 1 1 10 NA # 2 1 20 10 # 3 1 25 5 # 4 2 5 NA # 5 2 10 5 # 6 2 15 5
For alternatives pre-data.table::shift
and pre-dplyr::lag
, see edits.
You can use the base function ave()
for this
df <- data.frame(group=rep(c(1,2),each=3),value=c(10,20,25,5,10,15)) df$diff <- ave(df$value, factor(df$group), FUN=function(x) c(NA,diff(x)))
which returns
group value diff 1 1 10 NA 2 1 20 10 3 1 25 5 4 2 5 NA 5 2 10 5 6 2 15 5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With