Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lagging variable by group does not work in dplyr

Tags:

r

dplyr

lag

I'm desperately trying to lag a variable by group. I found this post that deals with essentially the same problem I'm facing, but the solution does not work for me, no idea why.

This is my problem:

library(dplyr)

df <- data.frame(monthvec = c(rep(1:2, 2), rep(3:5, 3)))
df <- df %>%
       arrange(monthvec) %>%
       mutate(growth=ifelse(monthvec==1, 0.3,
                   ifelse(monthvec==2, 0.5,
                          ifelse(monthvec==3, 0.7,
                                 ifelse(monthvec==4, 0.1,
                                        ifelse(monthvec==5, 0.6,NA))))))

df%>%
   group_by(monthvec) %>%
   mutate(lag.growth = lag(growth, order_by=monthvec))

Source: local data frame [13 x 3]
Groups: monthvec [5]

monthvec growth lag.growth
  <int>  <dbl>      <dbl>
1         1    0.3         NA
2         1    0.3        0.3
3         2    0.5         NA
4         2    0.5        0.5
5         3    0.7         NA
6         3    0.7        0.7
7         3    0.7        0.7
8         4    0.1         NA
9         4    0.1        0.1
10        4    0.1        0.1
11        5    0.6         NA
12        5    0.6        0.6
13        5    0.6        0.6

This is what I'd like it to be in the end:

df$lag.growth <- c(NA, NA, 0.3, 0.3, 0.5, 0.5, 0.5, 0.7,0.7,0.7, 0.1,0.1,0.1)

   monthvec growth lag.growth
1         1    0.3         NA
2         1    0.3         NA
3         2    0.5        0.3
4         2    0.5        0.3
5         3    0.7        0.5
6         3    0.7        0.5
7         3    0.7        0.5
8         4    0.1        0.7
9         4    0.1        0.7
10        4    0.1        0.7 
11        5    0.6        0.1
12        5    0.6        0.1
13        5    0.6        0.1

I believe that one problem is that my groups are not of equal length...

Thanks for helping out.

like image 438
Thomas Avatar asked Sep 21 '25 10:09

Thomas


2 Answers

Here is an idea. We group by monthvec in order to get the number of rows (cnt) of each group. We ungroup and use the first value of cnt as the size of the lag. We regroup on monthvec and replace the values in each group with the first value of each group.

library(dplyr)

df %>% 
 group_by(monthvec) %>% 
 mutate(cnt = n()) %>% 
 ungroup() %>% 
 mutate(lag.growth = lag(growth, first(cnt))) %>% 
 group_by(monthvec) %>% 
 mutate(lag.growth = first(lag.growth)) %>% 
 select(-cnt)

which gives,

# A tibble: 13 x 3
# Groups:   monthvec [5]
   monthvec growth lag.growth
      <int>  <dbl>      <dbl>
 1        1    0.3         NA
 2        1    0.3         NA
 3        2    0.5        0.3
 4        2    0.5        0.3
 5        3    0.7        0.5
 6        3    0.7        0.5
 7        3    0.7        0.5
 8        4    0.1        0.7
 9        4    0.1        0.7
10        4    0.1        0.7
11        5    0.6        0.1
12        5    0.6        0.1
13        5    0.6        0.1
like image 130
Sotos Avatar answered Sep 23 '25 10:09

Sotos


You may join your original data with a dataframe with a shifted "monthvec".

left_join(df, df %>% mutate(monthvec = monthvec + 1) %>% unique(), by = "monthvec")

#    monthvec growth.x growth.y
# 1         1      0.3       NA
# 2         1      0.3       NA
# 3         2      0.5      0.3
# 4         2      0.5      0.3
# 5         3      0.7      0.5
# 6         3      0.7      0.5
# 7         3      0.7      0.5
# 8         4      0.1      0.7
# 9         4      0.1      0.7
# 10        4      0.1      0.7
# 11        5      0.6      0.1
# 12        5      0.6      0.1
# 13        5      0.6      0.1
like image 33
Henrik Avatar answered Sep 23 '25 09:09

Henrik