Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr: lead() and lag() wrong when used with group_by()

Tags:

r

dplyr

I want to find the lead() and lag() element in each group, but had some wrong results.

For example, data is like this:

library(dplyr) df = data.frame(name=rep(c('Al','Jen'),3),                 score=rep(c(100, 80, 60),2)) df 

Data:

  name score 1   Al   100 2  Jen    80 3   Al    60 4  Jen   100 5   Al    80 6  Jen    60 

Now I try to find out lead() and lag() scores for each person. If I sort it using arrange(), I can get the correct answer:

df %>%   arrange(name) %>%   group_by(name) %>%   mutate(next.score = lead(score),          before.score = lag(score) ) 

OUTPUT1:

Source: local data frame [6 x 4] Groups: name        name score next.score before.score     1   Al   100         60           NA     2   Al    60         80          100     3   Al    80         NA           60     4  Jen    80        100           NA     5  Jen   100         60           80     6  Jen    60         NA          100 

Without arrange(), the result is wrong:

df %>%   group_by(name) %>%   mutate(next.score = lead(score),          before.score = lag(score) ) 

OUTPUT2:

Source: local data frame [6 x 4] Groups: name    name score next.score before.score 1   Al   100         80           NA 2  Jen    80         60           NA 3   Al    60        100           80 4  Jen   100         80           60 5   Al    80         NA          100 6  Jen    60         NA           80 

E.g., in 1st line, Al's next.score should be 60 (3rd line).

Anybody know why this happened? Why arrange() affects the result (the values, not just about the order)? Thanks~

like image 546
YJZ Avatar asked Jan 30 '15 11:01

YJZ


People also ask

What does lag () do in R?

lag lag shifts the times one back. It does not change the values, only the times. Thus lag changes the tsp attribute from c(1, 4, 1) to c(0, 3, 1) . The start time is shifted from 1 to 0, the end time is shifted from 4 to 3 and since shifts do not change the frequency the frequency remains 1.

What is the opposite of lag in R?

The opposite of lag() function is lead()


Video Answer


2 Answers

It seems you have to pass additional argument to lag and lead functions. When I run your function without arrange, but with order_by added, everything seems to be ok.

df %>% group_by(name) %>% mutate(next.score = lead(score, order_by=name), before.score = lag(score, order_by=name)) 

Output:

  name score next.score before.score 1   Al   100         60           NA 2  Jen    80        100           NA 3   Al    60         80          100 4  Jen   100         60           80 5   Al    80         NA           60 6  Jen    60         NA          100 

My sessionInfo():

R version 3.1.1 (2014-07-10) Platform: x86_64-w64-mingw32/x64 (64-bit)  locale: [1] LC_COLLATE=Polish_Poland.1250  LC_CTYPE=Polish_Poland.1250        LC_MONETARY=Polish_Poland.1250 [4] LC_NUMERIC=C                   LC_TIME=Polish_Poland.1250      attached base packages: [1] stats     graphics  grDevices utils     datasets  methods   base       other attached packages: [1] dplyr_0.4.1  loaded via a namespace (and not attached): [1] assertthat_0.1  DBI_0.3.1       lazyeval_0.1.10 magrittr_1.5                parallel_3.1.1  Rcpp_0.11.5     [7] tools_3.1.1  
like image 68
Tomasz Sosiński Avatar answered Oct 10 '22 22:10

Tomasz Sosiński


It may happen that stats::lag is used instead (e.g. when restoring environments with the session package). This can easly slip through unnoticed as it it won't throw an error when being used as in the question. Double-check by simply typing lag, use conflicted package, or disambiguate the function call by calling dplyr::lag instead.

The same could happen for plyr::mutate, in case you might have loaded plyr package in your session. So make sure you're also calling dplyr::mutate.

like image 22
Holger Brandl Avatar answered Oct 10 '22 22:10

Holger Brandl