Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Subset multiple columns in R - more elegant code?

Tags:

r

subset

I am subsetting a dataframe according to multiple criteria across several columns. I am choosing the rows in the dataframe that contain any one of several values defined in the vector "criteria" in any one of three different columns.

I have some code that works, but wonder what other (more elegant?) ways there are to do this. Here is what I've done:

criteria <-c(1:10)
subset1 <-subset(data, data[, "Col1"] %in% criteria | data[, "Col2"]
 %in% criteria | data[, "Col3"] %in% criteria)

Suggestions warmly welcomed. (I am an R beginner, so very simple explanations about what you are suggesting are also warmly welcomed.)

like image 716
user1257313 Avatar asked Mar 09 '12 21:03

user1257313


2 Answers

I'm not sure if you need two apply calls here:

# Data
df=data.frame(x=1:4,Col1=c(11,12,3,13),Col2=c(9,12,10,13),Col3=c(9,13,42,23))
criteria=1:10

# Solution
df[apply(df [c('Col1','Col2','Col3')],1,function(x) any(x %in% criteria)),]

Unless you want to do a lot of columns, then it is probably more readable to say:

subset(df, Col1 %in% criteria | Col2 %in% criteria | Col3 %in% criteria) 
like image 186
nograpes Avatar answered Sep 27 '22 22:09

nograpes


I'm using DF rather than data as the example.

DF[apply(apply(as.matrix(DF[c("Col1","Col2","Col3")]), 
               c(1,2), `%in%`, criteria), 
         1, any),]

For a breakdown of what this is doing:

Make a matrix of the specified columns, and for each element in that matrix test if it contains one of the criteria. Then for each row of that matrix, see if any of the row elements are TRUE. If so, keep the corresponding row of the original dataset.

Working through an example:

Start with dummy data:

DF <- data.frame(Col1=seq(1, by=2, length=10),
                 Col2=seq(3, by=3, length=10),
                 Col3=seq(7, by=1, length=10),
                 other=LETTERS[1:10])

which looks like

> DF
   Col1 Col2 Col3 other
1     1    3    7     A
2     3    6    8     B
3     5    9    9     C
4     7   12   10     D
5     9   15   11     E
6    11   18   12     F
7    13   21   13     G
8    15   24   14     H
9    17   27   15     I
10   19   30   16     J

Pull out just the columns of interest.

> as.matrix(DF[c("Col1","Col2","Col3")])
      Col1 Col2 Col3
 [1,]    1    3    7
 [2,]    3    6    8
 [3,]    5    9    9
 [4,]    7   12   10
 [5,]    9   15   11
 [6,]   11   18   12
 [7,]   13   21   13
 [8,]   15   24   14
 [9,]   17   27   15
[10,]   19   30   16

Check each entry versus the criteria

> apply(as.matrix(DF[c("Col1","Col2","Col3")]), c(1,2), `%in%`, criteria)
       Col1  Col2  Col3
 [1,]  TRUE  TRUE  TRUE
 [2,]  TRUE  TRUE  TRUE
 [3,]  TRUE  TRUE  TRUE
 [4,]  TRUE FALSE  TRUE
 [5,]  TRUE FALSE FALSE
 [6,] FALSE FALSE FALSE
 [7,] FALSE FALSE FALSE
 [8,] FALSE FALSE FALSE
 [9,] FALSE FALSE FALSE
[10,] FALSE FALSE FALSE

Test if any of the values in a row are TRUE

> apply(apply(as.matrix(DF[c("Col1","Col2","Col3")]), c(1,2), `%in%`, criteria), 1, any)
 [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

Use that to index the original data frame.

> DF[apply(apply(as.matrix(DF[c("Col1","Col2","Col3")]), c(1,2), `%in%`, criteria), 1, any),]
  Col1 Col2 Col3 other
1    1    3    7     A
2    3    6    8     B
3    5    9    9     C
4    7   12   10     D
5    9   15   11     E
like image 31
Brian Diggs Avatar answered Sep 27 '22 22:09

Brian Diggs