I have 18 pairs of variable and I would like to do pair-wise math on them to calculate 18 new variables. The across() function in dplyr is quite handy when applying a formula to one column. Is there a way to apply across() to pairs of columns?
Tiny example with simple division of 2 variables (my actual code will be more complex, some ifelse, ...):
library(tidyverse)
library(glue)
# filler data
df <- data.frame("label" = c('a','b','c','d'),
"A" = c(4, 3, 8, 9),
"B" = c(10, 0, 4, 1),
"error_A" = c(0.4, 0.3, 0.2, 0.1),
"error_B" = c(0.3, 0, 0.4, 0.1))
# what I want to have in the end
# instead of just 2 (A, B), I have 18
df1 <- df %>% mutate(
'R_A' = A/error_A,
'R_B' = B/error_B
)
# what I'm thinking about doing to use both variables A and error_A to calculate the new column
df2 <- df %>% mutate(
across(c('A','B'),
~.x/{HOW DO I USE THE COLUMN WHOSE NAME IS glue('error_',.x)}
.names = 'R_{.col}'
)
To pick out single or multiple columns use the select() function. The select() function expects a dataframe as it's first input ('argument', in R language), followed by the names of the columns you want to extract with a comma between each name.
Similarly to readr , dplyr and tidyr are also part of the tidyverse. These packages were loaded in R's memory when we called library(tidyverse) earlier.
Add a column to a dataframe in R using dplyr. In my opinion, the best way to add a column to a dataframe in R is with the mutate() function from dplyr . mutate() , like all of the functions from dplyr is easy to use.
You can use the mutate() function from the dplyr package to add one or more columns to a data frame in R.
The first argument here in across () is the columns to operate on. That takes any dplyr::select () syntax. I also could have given a range of columns instead of is.numeric, such as
The first argument here in across () is the columns to operate on. That takes any dplyr::select () syntax. I also could have given a range of columns instead of is.numeric, such as Or, I could use starts_with ().
To that end, filter () has two special purpose companion functions: if_any () keeps the rows where the predicate is true for at least one selected column: Prior versions of dplyr allowed you to apply a function to multiple columns in a different way: using functions with _if, _at, and _all () suffixes.
Update: as of June 1, dplyr 1.0.0 is now available on CRAN! Read all about it or install it now with install.packages ("dplyr"). It’s often useful to perform the same operation on multiple columns, but copying and pasting is both tedious and error prone:
One option is map/reduce
. Specify the columns of interest ('nm1'), loop over them in map
, select
those columns from the dataset, reduce
by dividing, rename
the columns after column binding (_dfc
), and bind those with the original dataset
library(dplyr)
library(purrr)
library(stringr)
nm1 <- c('A', 'B')
map_dfc(nm1, ~ df %>%
select(ends_with(.x)) %>%
reduce(., `/`) ) %>%
rename_all(~ str_c('R_', nm1)) %>%
bind_cols(df, .)
-output
# label A B error_A error_B R_A R_B
#1 a 4 10 0.4 0.3 10 33.33333
#2 b 3 0 0.3 0.0 10 NaN
#3 c 8 4 0.2 0.4 40 10.00000
#4 d 9 1 0.1 0.1 90 10.00000
Or another option with across
df %>%
mutate(across(c(A, B), ~
./get(str_c('error_', cur_column() )), .names = 'R_{.col}' ))
# label A B error_A error_B R_A R_B
#1 a 4 10 0.4 0.3 10 33.33333
#2 b 3 0 0.3 0.0 10 NaN
#3 c 8 4 0.2 0.4 40 10.00000
#4 d 9 1 0.1 0.1 90 10.00000
One option could be:
df %>%
mutate(across(c(A, B), .names = "R_{col}")/across(starts_with("error")))
label A B error_A error_B R_A R_B
1 a 4 10 0.4 0.3 10 33.33333
2 b 3 0 0.3 0.0 10 NaN
3 c 8 4 0.2 0.4 40 10.00000
4 d 9 1 0.1 0.1 90 10.00000
I like akruns answer above, especially the approach with cur_column()
. Interestingly, cur_column()
can't be used with {rlang}'s evaluation (!! sym(paste0("error_", cur_column()))
), but get
is a nice workaround.
Just to add one more approach, which also works under dpylr < 1.0.0. I usually use a mutate
custom function together with purrr::reduce()
. In this function x
is your string stem and you construct all variables you want to access with !! sym(paste0(...))
. On the left hand-side you can just use {rlang}'s glue syntax.
You apply this custom function by calling reduce()
on the vector of strings and your data.frame
goes in the .init = .
argument.
library(tidyverse)
library(glue)
# filler data
df <- data.frame("label" = c('a','b','c','d'),
"A" = c(4, 3, 8, 9),
"B" = c(10, 0, 4, 1),
"error_A" = c(0.4, 0.3, 0.2, 0.1),
"error_B" = c(0.3, 0, 0.4, 0.1))
gen_vars1 <- function(df, x) {
mutate(df,
"R_{x}" := !! sym(x) / !! sym(paste0("error_", x)))
}
df %>%
reduce(c("A", "B"), gen_vars1, .init = .)
#> label A B error_A error_B R_A R_B
#> 1 a 4 10 0.4 0.3 10 33.33333
#> 2 b 3 0 0.3 0.0 10 NaN
#> 3 c 8 4 0.2 0.4 40 10.00000
#> 4 d 9 1 0.1 0.1 90 10.00000
Created on 2021-01-02 by the reprex package (v0.3.0)
I once opened a feature request for this kind of problem, but apparently it is too special case for {dplyr}. When you follow the link you can also find another option to do this kind of operation.
For such cases I find base R solution straight-forward and efficient too. It doesn't require to loop over columns or unique values. You define two groups of column and divide them directly.
For the example that you have shared we can identify "A"
and "B"
columns by looking for columns names that have only one character in them.
cols <- grep('^.$', names(df), value = TRUE)
error_cols <- grep('error', names(df), value = TRUE)
df[paste0('R_', cols)] <- df[cols]/df[error_cols]
df
# label A B error_A error_B R_A R_B
#1 a 4 10 0.4 0.3 10 33.3
#2 b 3 0 0.3 0.0 10 NaN
#3 c 8 4 0.2 0.4 40 10.0
#4 d 9 1 0.1 0.1 90 10.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With