I am subsetting a dataframe according to multiple criteria across several columns. I am choosing the rows in the dataframe that contain any one of several values defined in the vector "criteria" in any one of three different columns.
I have some code that works, but wonder what other (more elegant?) ways there are to do this. Here is what I've done:
criteria <-c(1:10)
subset1 <-subset(data, data[, "Col1"] %in% criteria | data[, "Col2"]
%in% criteria | data[, "Col3"] %in% criteria)
Suggestions warmly welcomed. (I am an R beginner, so very simple explanations about what you are suggesting are also warmly welcomed.)
I'm not sure if you need two apply
calls here:
# Data
df=data.frame(x=1:4,Col1=c(11,12,3,13),Col2=c(9,12,10,13),Col3=c(9,13,42,23))
criteria=1:10
# Solution
df[apply(df [c('Col1','Col2','Col3')],1,function(x) any(x %in% criteria)),]
Unless you want to do a lot of columns, then it is probably more readable to say:
subset(df, Col1 %in% criteria | Col2 %in% criteria | Col3 %in% criteria)
I'm using DF
rather than data
as the example.
DF[apply(apply(as.matrix(DF[c("Col1","Col2","Col3")]),
c(1,2), `%in%`, criteria),
1, any),]
For a breakdown of what this is doing:
Make a matrix of the specified columns, and for each element in that matrix test if it contains one of the criteria. Then for each row of that matrix, see if any of the row elements are TRUE
. If so, keep the corresponding row of the original dataset.
Working through an example:
Start with dummy data:
DF <- data.frame(Col1=seq(1, by=2, length=10),
Col2=seq(3, by=3, length=10),
Col3=seq(7, by=1, length=10),
other=LETTERS[1:10])
which looks like
> DF
Col1 Col2 Col3 other
1 1 3 7 A
2 3 6 8 B
3 5 9 9 C
4 7 12 10 D
5 9 15 11 E
6 11 18 12 F
7 13 21 13 G
8 15 24 14 H
9 17 27 15 I
10 19 30 16 J
Pull out just the columns of interest.
> as.matrix(DF[c("Col1","Col2","Col3")])
Col1 Col2 Col3
[1,] 1 3 7
[2,] 3 6 8
[3,] 5 9 9
[4,] 7 12 10
[5,] 9 15 11
[6,] 11 18 12
[7,] 13 21 13
[8,] 15 24 14
[9,] 17 27 15
[10,] 19 30 16
Check each entry versus the criteria
> apply(as.matrix(DF[c("Col1","Col2","Col3")]), c(1,2), `%in%`, criteria)
Col1 Col2 Col3
[1,] TRUE TRUE TRUE
[2,] TRUE TRUE TRUE
[3,] TRUE TRUE TRUE
[4,] TRUE FALSE TRUE
[5,] TRUE FALSE FALSE
[6,] FALSE FALSE FALSE
[7,] FALSE FALSE FALSE
[8,] FALSE FALSE FALSE
[9,] FALSE FALSE FALSE
[10,] FALSE FALSE FALSE
Test if any of the values in a row are TRUE
> apply(apply(as.matrix(DF[c("Col1","Col2","Col3")]), c(1,2), `%in%`, criteria), 1, any)
[1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
Use that to index the original data frame.
> DF[apply(apply(as.matrix(DF[c("Col1","Col2","Col3")]), c(1,2), `%in%`, criteria), 1, any),]
Col1 Col2 Col3 other
1 1 3 7 A
2 3 6 8 B
3 5 9 9 C
4 7 12 10 D
5 9 15 11 E
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With