Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Refering to column names inside dplyr's across()

Tags:

r

dplyr

tidyverse

Is it possible to refer to column names in a lambda function inside across()?

df <- tibble(age = c(12, 45), sex = c('f', 'f'))
allowed_values <- list(age = 18:100, sex = c("f", "m"))

df %>%
  mutate(across(c(age, sex),
                c(valid = ~ .x %in% allowed_values[[COLNAME]])))

I just came across this question where OP asks about validating columns in a dataframe based on a list of allowed values.

dplyr just gained across() and it seems like a natural choice, but we need columns names to look up the allowed values.

The best I could come up with was a call to imap_dfr, but it is more cumbersome to integrate into an anlysis pipeline, because the results need to be re-combined with the original dataframe.

like image 258
severin Avatar asked Jun 02 '20 15:06

severin


People also ask

What does across () do in R?

across() returns a tibble with one column for each column in .

How do I specify multiple columns in R?

To get multiple columns of matrix, specify the column numbers as a vector preceded by a comma, in square brackets, after the matrix variable name. This expression returns the required columns as a matrix.

What does %>% do in dplyr?

%>% is called the forward pipe operator in R. It provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. It is defined by the package magrittr (CRAN) and is heavily used by dplyr (CRAN).


2 Answers

The answer is yes, you can refer to column names in dplyr's across. You need to use cur_column(). Your original answer was so close! Insert cur_column() into your solution where you want the column name:

library(tidyverse)

df <- tibble(age = c(12, 45), sex = c('f', 'f'))
allowed_values <- list(age = 18:100, sex = c("f", "m"))

df %>%
  mutate(across(c(age, sex),
                c(valid = ~ .x %in% allowed_values[[cur_column()]])
                )
         )

Reference: https://dplyr.tidyverse.org/articles/colwise.html#current-column

like image 92
s_pike Avatar answered Oct 18 '22 18:10

s_pike


I think that you may be asking too much of across at this point (but this may spur additional development, so maybe someday it will work the way you suggest).

I think that the imap functions from the purrr package may give you what you want at this point:

> df <- tibble(age = c(12, 45), sex = c('f', 'f'))
> allowed_values <- list(age = 18:100, sex = c("f", "m"))
> 
> df %>% imap( ~ .x %in% allowed_values[[.y]])
$age
[1] FALSE  TRUE

$sex
[1] TRUE TRUE

> df %>% imap_dfc( ~ .x %in% allowed_values[[.y]])
# A tibble: 2 x 2
  age   sex  
  <lgl> <lgl>
1 FALSE TRUE 
2 TRUE  TRUE 

If you want a single column with the combined validity then you can pass the result through reduce:

> df %>% imap( ~ .x %in% allowed_values[[.y]]) %>%
+   reduce(`&`)
[1] FALSE  TRUE

This could then be added as a new column to the original data, or just used for subsetting the data. I am not expert enough with the tidyverse yet to know if this could be combined with mutate to add the columns directly.

like image 2
Greg Snow Avatar answered Oct 18 '22 19:10

Greg Snow