Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

diff operation within a group, after a dplyr::group_by()

Tags:

Let's say I have this data.frame (with 3 variables)

ID  Period  Score
123 2013    146
123 2014    133
23  2013    150
456 2013    205
456 2014    219
456 2015    140
78  2012    192
78  2013    199
78  2014    133
78  2015    170

Using dplyr I can group them by ID and filter these ID that appear more than once

data <- data %>% group_by(ID) %>% filter(n() > 1)

Now, what I like to achieve is to add a column that is: Difference = Score of Period P - Score of Period P-1 to get something like this:

ID  Period  Score   Difference
123 2013    146 
123 2014    133 -13
456 2013    205 
456 2014    219 14
456 2015    140 -79
78  2012    192 
78  2013    199 7
78  2014    133 -66
78  2015    170 37

It is rather trivial to do this in a spreadsheet, but I have no idea on how I can achieve this in R.
Thanks for any help or guidance.

like image 616
Franky Avatar asked Jan 20 '15 12:01

Franky


People also ask

How does Dplyr Group_by work?

Groupby Function in R – group_by is used to group the dataframe in R. Dplyr package in R is provided with group_by() function which groups the dataframe by multiple columns with mean, sum and other functions like count, maximum and minimum.

What is the use of the Group_by function?

Most data operations are done on groups defined by variables. group_by() takes an existing tbl and converts it into a grouped tbl where operations are performed "by group". ungroup() removes grouping.

Can you group by multiple columns in Dplyr?

The group_by() method is used to group the data contained in the data frame based on the columns specified as arguments to the function call.

Is there a Groupby function in R?

Group_by() function belongs to the dplyr package in the R programming language, which groups the data frames. Group_by() function alone will not give any output. It should be followed by summarise() function with an appropriate action to perform. It works similar to GROUP BY in SQL and pivot table in excel.


1 Answers

Here is another solution using lag. Depending on the use case it might be more convenient than diff because the NAs clearly show that a particular value did not have predecessor whereas a 0 using diff might be the result of a) a missing predecessor or of b) the subtraction between two periods.

data %>% group_by(ID) %>% filter(n() > 1) %>%
  mutate(
    Difference = Score - lag(Score)
    )

#   ID Period Score Difference
# 1 123   2013   146         NA
# 2 123   2014   133        -13
# 3 456   2013   205         NA
# 4 456   2014   219         14
# 5 456   2015   140        -79
# 6  78   2012   192         NA
# 7  78   2013   199          7
# 8  78   2014   133        -66
# 9  78   2015   170         37
like image 199
alex23lemm Avatar answered Sep 18 '22 12:09

alex23lemm