Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to group_by without creating a grouping variable?

I need to perform a basic group_by / mutate operation using an auxiliary grouping variable. For instance:

df <- data.frame(
  u = c(0, 0, 1, 0, 1),
  v = c(8, 4, 2, 3, 5)
)

df %>%
  group_by(tmp = cumsum(u)) %>%
  mutate(w = cumprod(v)) %>%
  ungroup %>%
  select(-tmp)

My problem is that if df happens to already contain a column named tmp I will lose it.

Of course I could choose a very exotic name instead of tmp to reduce the likeliness of a collision (or I could even choose something like strrep("z", max(nchar(names(df))) + 1) to be sure) but I'd prefer to have a cleaner solution.

In other words, I'm looking for the dplyr equivalent of this data.table line:

setDT(df)[, w := cumprod(v), by = cumsum(u)]
like image 309
Scarabee Avatar asked Jan 29 '23 11:01

Scarabee


2 Answers

We could create a function to take care of this. Assuming that the temporary grouping variable to be created is 'tmp', by concatenating with the column names of the dataset and calling make.unique, if there is already a 'tmp' column in the dataset, the duplicate one will be renamed as 'tmp.1'. Using the !!, naming the column with 'tmp.1' (from nm1) will not affect the 'tmp' already present in the dataset. In case, if there is no 'tmp', column, the grouping column will be named as 'tmp' and later removed with select

f1 <- function(dat, grpCol, Col) {
  grpCol <- enquo(grpCol)
  Col <- enquo(Col)

 changeCol <- "tmp"
 nm1 <-  tail(make.unique(c(names(dat), changeCol)), 1)
 dat %>%
    group_by(!! (nm1) := cumsum(!! grpCol)) %>%
    mutate(w = cumprod(!!Col)) %>%
     ungroup %>%
     select(-one_of(nm1)) 


}

-run the function

f1(df, u, v)
# A tibble: 5 x 3
#      u     v     w
#  <dbl> <dbl> <dbl>
#1  0     8.00  8.00
#2  0     4.00 32.0 
#3  1.00  2.00  2.00
#4  0     3.00  6.00
#5  1.00  5.00  5.00


 f1(df %>% mutate(tmp = 1), u, v) #create a 'tmp' column in dataset
# A tibble: 5 x 4
#      u     v   tmp     w
#  <dbl> <dbl> <dbl> <dbl>
#1  0     8.00  1.00  8.00
#2  0     4.00  1.00 32.0 
#3  1.00  2.00  1.00  2.00
#4  0     3.00  1.00  6.00
#5  1.00  5.00  1.00  5.00

As a followup (comments from @Frank) about passing expressions

expr <- quos(tmp = cumsum(u), w = cumprod(v))
#additional checks outside the function
names(expr)[1] <- if(names(expr)[1] %in% names(df)) 
             strrep(names(expr)[1], 2) else names(expr)[1]


f2 <- function(dat, exprs ){

dat %>%
    group_by(!!! exprs[1]) %>%
    mutate(!!! exprs[2])

}

f2(df, expr)
# A tibble: 5 x 4
# Groups: tmp [3]
#      u     v   tmp     w
#  <dbl> <dbl> <dbl> <dbl> 
#1  0     8.00  0     8.00
#2  0     4.00  0    32.0 
#3  1.00  2.00  1.00  2.00
#4  0     3.00  1.00  6.00
#5  1.00  5.00  2.00  5.00
like image 191
akrun Avatar answered Jan 31 '23 07:01

akrun


You could use ave instead:

df %>% mutate(w = ave(v, cumsum(u), FUN = cumprod))

by would also work:

df %>% 
   by(cumsum(.$u), mutate, w = cumprod(v)) %>% 
   unclass %>% 
   bind_rows
like image 21
G. Grothendieck Avatar answered Jan 31 '23 09:01

G. Grothendieck