I have a data frame with 5 columns and I want to produce 4 additional columns giving my the difference between the last 4 columns and the first column. I tried the following, but that doesn't work:
library(tidyverse)
df <- as.tibble(data.frame(A = c(1,2), B = c(3,4), C = c(4,5), D = c(2,3), E = c(4,5)))
r_diff <- function(x,y){
z = y - x
return(z)
}
vars_to_process <- c("B","C","D","E")
df %>% mutate_at(.cols=vars_to_process, .funs =r_diff(.,df[,1])) %>% head()
Thanks Renger
Here's the simplest way to do it.
df %>%
mutate_at(.vars = vars(B:E),
.funs = list(~ . - A))
The .vars
argument lets you specify columns in the same way that you would specify columns in select()
, provided you put that specification inside the function vars()
.
The .funs
argument accepts an anonymous function defined on the fly inside a call to list()
. And you can reference a column in the dataframe (in this case A
) when defining this anonymous function (see this Stackoverflow question).
In addition, with the release of dplyr 1.0.0, you can now simply do the following:
df %>%
mutate(across(B:E, ~ . - A))
Here's a faster solution using base R code. Strategy is convert to a matrix, subtract column one from the required columns, build back into a data frame. Note this only returns the modified columns - if there are columns not in vars_to_process
they'll not appear in the output but you didn't have any of those in your test set so I'll assume they don't exist.
So, always write things in functions whenever possible:
bsr = function(df,vars_to_process){
m = as.matrix(df)
data.frame(
A = m[, 1],
m[, 1] - m[, vars_to_process])}
Make some test data:
> df = data.frame(matrix(runif(5*1000), ncol=5))
> names(df)=LETTERS[1:5]
> dft = as.tibble(df)
> head(dft)
# A tibble: 6 x 5
A B C D E
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.2609174 0.07857624 0.2727817 0.8498004 0.3403234
2 0.3644744 0.95810657 0.8183856 0.2958133 0.4752349
3 0.6042914 0.98793218 0.7547003 0.9596591 0.5354045
4 0.4000441 0.61403331 0.9018804 0.3838347 0.3266855
5 0.6767012 0.11984219 0.9181570 0.5988404 0.6058629
Compare with the tidyverse version:
akr = function(df,vars_to_process){
df %>% mutate_at(vars_to_process, funs(r_diff(.,df[[1]])))
}
Check bsr
and akr
agree:
> head(bsr(dft, vars_to_process))
A B C D E
1 0.2609174 0.1823412 -0.01186432 -0.58888295 -0.07940594
2 0.3644744 -0.5936322 -0.45391119 0.06866108 -0.11076050
3 0.6042914 -0.3836408 -0.15040892 -0.35536765 0.06888696
4 0.4000441 -0.2139892 -0.50183635 0.01620939 0.07335861
> head(akr(dft, vars_to_process))
# A tibble: 6 x 5
A B C D E
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.2609174 0.1823412 -0.01186432 -0.58888295 -0.07940594
2 0.3644744 -0.5936322 -0.45391119 0.06866108 -0.11076050
3 0.6042914 -0.3836408 -0.15040892 -0.35536765 0.06888696
4 0.4000441 -0.2139892 -0.50183635 0.01620939 0.07335861
okay, except akr
returns a tribble but nm. Benchmark:
> microbenchmark(bsr(dft, vars_to_process),akr(dft, vars_to_process))
Unit: microseconds
expr min lq mean median uq
bsr(dft, vars_to_process) 362.117 388.7215 488.9309 446.123 521.776
akr(dft, vars_to_process) 8070.391 8365.4230 9853.5239 8673.692 9335.613
Base R version is 26 times faster. I'd also argue that subtracting a column from another set of columns is tidier than applying a mutator function but as long as you wrap what your doing in a function it doesn't matter how messy the guts are.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With