Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

using mutate_at from dplyr

Tags:

r

dplyr

I have a data frame with 5 columns and I want to produce 4 additional columns giving my the difference between the last 4 columns and the first column. I tried the following, but that doesn't work:

library(tidyverse)
df <- as.tibble(data.frame(A = c(1,2), B = c(3,4), C = c(4,5), D = c(2,3), E = c(4,5)))
r_diff <- function(x,y){
  z = y - x
  return(z)
}
vars_to_process <- c("B","C","D","E")
df %>% mutate_at(.cols=vars_to_process, .funs =r_diff(.,df[,1])) %>% head()

Thanks Renger

like image 716
arnyeinstein Avatar asked Dec 03 '22 12:12

arnyeinstein


2 Answers

Here's the simplest way to do it.

df %>% 
   mutate_at(.vars = vars(B:E),
             .funs = list(~ . - A))

The .vars argument lets you specify columns in the same way that you would specify columns in select(), provided you put that specification inside the function vars().

The .funs argument accepts an anonymous function defined on the fly inside a call to list(). And you can reference a column in the dataframe (in this case A) when defining this anonymous function (see this Stackoverflow question).

In addition, with the release of dplyr 1.0.0, you can now simply do the following:

df %>%
   mutate(across(B:E, ~ . - A))
like image 165
bschneidr Avatar answered Dec 22 '22 22:12

bschneidr


Here's a faster solution using base R code. Strategy is convert to a matrix, subtract column one from the required columns, build back into a data frame. Note this only returns the modified columns - if there are columns not in vars_to_process they'll not appear in the output but you didn't have any of those in your test set so I'll assume they don't exist.

So, always write things in functions whenever possible:

bsr = function(df,vars_to_process){
    m = as.matrix(df)
    data.frame(
         A = m[, 1],
             m[, 1] - m[, vars_to_process])}

Make some test data:

> df = data.frame(matrix(runif(5*1000), ncol=5))
> names(df)=LETTERS[1:5]
> dft = as.tibble(df)
> head(dft)
# A tibble: 6 x 5
          A          B         C         D         E
      <dbl>      <dbl>     <dbl>     <dbl>     <dbl>
1 0.2609174 0.07857624 0.2727817 0.8498004 0.3403234
2 0.3644744 0.95810657 0.8183856 0.2958133 0.4752349
3 0.6042914 0.98793218 0.7547003 0.9596591 0.5354045
4 0.4000441 0.61403331 0.9018804 0.3838347 0.3266855
5 0.6767012 0.11984219 0.9181570 0.5988404 0.6058629

Compare with the tidyverse version:

akr = function(df,vars_to_process){
   df %>% mutate_at(vars_to_process, funs(r_diff(.,df[[1]])))
   }

Check bsr and akr agree:

> head(bsr(dft, vars_to_process))
          A          B           C           D           E
1 0.2609174  0.1823412 -0.01186432 -0.58888295 -0.07940594
2 0.3644744 -0.5936322 -0.45391119  0.06866108 -0.11076050
3 0.6042914 -0.3836408 -0.15040892 -0.35536765  0.06888696
4 0.4000441 -0.2139892 -0.50183635  0.01620939  0.07335861

> head(akr(dft, vars_to_process))
# A tibble: 6 x 5
          A          B           C           D           E
      <dbl>      <dbl>       <dbl>       <dbl>       <dbl>
1 0.2609174  0.1823412 -0.01186432 -0.58888295 -0.07940594
2 0.3644744 -0.5936322 -0.45391119  0.06866108 -0.11076050
3 0.6042914 -0.3836408 -0.15040892 -0.35536765  0.06888696
4 0.4000441 -0.2139892 -0.50183635  0.01620939  0.07335861

okay, except akr returns a tribble but nm. Benchmark:

> microbenchmark(bsr(dft, vars_to_process),akr(dft, vars_to_process))
Unit: microseconds
                      expr      min        lq      mean   median       uq
 bsr(dft, vars_to_process)  362.117  388.7215  488.9309  446.123  521.776
 akr(dft, vars_to_process) 8070.391 8365.4230 9853.5239 8673.692 9335.613

Base R version is 26 times faster. I'd also argue that subtracting a column from another set of columns is tidier than applying a mutator function but as long as you wrap what your doing in a function it doesn't matter how messy the guts are.

like image 39
Spacedman Avatar answered Dec 22 '22 23:12

Spacedman