Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Selecting columns based on row values in multiple columns using dplyr

Tags:

r

dplyr

I am trying to select columns where at least one row equals 1, only if the same row also has a certain value in a second column. I would prefer to achieve this using dplyr, but any computationally efficient solution is welcome.

Example:

Select columns among a1, a2, a3 containing at least one row where the value is 1 AND where column b=="B"

Example data:

rand <- function(S) {set.seed(S); sample(x = c(0,1),size = 3, replace=T)}
df <- data.frame(a1=rand(1),a2=rand(2),a3=rand(3),b=c("A","B","A"))

Input data:

  a1 a2 a3 b
1  0  0  0 A
2  0  1  1 B
3  1  1  0 A

Desired output:

  a2 a3
1  0  0
2  1  1
3  1  0

I managed to obtain the correct output with the following code, however this is a very inefficient solution and I need to run it on a very large dataframe (365,000 rows X 314 columns).

df %>% select_if(function(x) any(paste0(x,.$b) == '1B'))
like image 225
cmdoret Avatar asked Dec 06 '17 08:12

cmdoret


2 Answers

A solution, not using dplyr:

df[sapply(df[df$b == "B",], function(x) 1 %in% x)]
like image 164
jlesuffleur Avatar answered Nov 05 '22 21:11

jlesuffleur


Here is my dplyr solution:

ids <- df %>% 
  reshape2::melt(id.vars = "b") %>% 
  filter(value == 1 & b == "B") %>% 
  select(variable)

df[,unlist(ids)]

#  a2 a3
#1  0  0
#2  1  1
#3  1  0

As suggested by @docendo-discimus it is easier to convert to long format

like image 23
J_F Avatar answered Nov 05 '22 23:11

J_F