My dataframe looks like this:
x1 <- c("a", "c", "f", "j")
x2 <- c("b", "c", "g", "k")
x3 <- c("b", "d", "h", NA)
x4 <- c("a", "e", "i", NA)
df <- data.frame(x1, x2, x3, x4, stringsAsFactors=F)
df
x1 x2 x3 x4
1 a b b a
2 c c d e
3 f g h i
4 j k <NA> <NA>
Now I have an arbitrary vector:
vec <- c("a", "i", "s", "t", "z")
I would like to compare the vector values with each row in the data frame and create an additional column that indicates whether at least one (ANY) of the vector values was found or not.
The resulting dataframe should look like this:
x1 x2 x3 x4 valueFound
1 a b b a 1
2 c c d e 0
3 f g h i 1
4 j k <NA> <NA> 0
I would like to do it without looping. Thank you very much for your support!
Rami
Using subset() subset() is also a R base function that also mostly used to select rows from the DataFrame. This function takes the DataFrame object as input and the condition to select rows.
We can compare two columns in R by using ifelse(). This statement is used to check the condition given and return the data accordingly.
rbind() function in R Language is used to combine specified Vector, Matrix or Data Frame by rows. deparse. level: This value determines how the column names generated. The default value of deparse.
By using bracket notation on R DataFrame (data.name) we can select rows by column value, by index, by name, by condition e.t.c. You can also use the R base function subset() to get the same results. Besides these, R also provides another function dplyr::filter() to get the rows from the DataFrame.
This would be faster than an apply
based solution (despite it's cryptic construction):
as.numeric(rowSums(`dim<-`(as.matrix(df) %in% vec, dim(df))) >= 1)
[1] 1 0 1 0
Here, we can make up some bigger data to test on.... These benchmarks are on 100k rows.
set.seed(1)
nrow <- 100000
ncol <- 10
vec <- c("a", "i", "s", "t", "z")
df <- data.frame(matrix(sample(c(letters, NA), nrow * ncol, TRUE),
nrow = nrow, ncol = ncol), stringsAsFactors = FALSE)
Here are the approaches we have so far:
AM <- function() as.numeric(rowSums(`dim<-`(as.matrix(df) %in% vec, dim(df))) >= 1)
NR1 <- function() {
apply(df,1,function(x){
if(any(x %in% vec)){
1
} else {
0
}
})
}
NR2 <- function() apply(df, 1, function(x) any(x %in% vec) + 0)
NR3 <- function() apply(df, 1, function(x) as.numeric(any(x %in% vec)))
NR4 <- function() apply(df, 1, function(x) any(x %in% vec) %/% TRUE)
NR5 <- function() apply(df, 1, function(x) cumprod(any(x %in% vec)))
RS1 <- function() as.numeric(grepl(paste(vec, collapse="|"), do.call(paste, df)))
RS2 <- function() as.numeric(seq(nrow(df)) %in% row(df)[unlist(df) %in% vec])
I'm suspecting the NR functions will be a little slower:
system.time(NR1()) # Other NR functions are about the same
# user system elapsed
# 1.172 0.000 1.196
And, similarly, Richard's second approach:
system.time(RS2())
# user system elapsed
# 0.918 0.000 0.932
The grepl
and this rowSum
function are left for the benchmarks:
library(microbenchmark)
microbenchmark(AM(), RS1())
# Unit: milliseconds
# expr min lq mean median uq max neval
# AM() 65.75296 67.2527 92.03043 84.58111 102.3199 234.6114 100
# RS1() 253.57360 256.6148 266.89640 260.18038 264.1531 385.6525 100
Here's one way to do this:
df$valueFound <- apply(df,1,function(x){
if(any(x %in% vec)){
1
} else {
0
}
})
##
> df
x1 x2 x3 x4 valueFound
1 a b b a 1
2 c c d e 0
3 f g h i 1
4 j k <NA> <NA> 0
Thanks to @David Arenburg and @CathG, a couple of more concise approaches:
apply(df, 1, function(x) any(x %in% vec) + 0)
apply(df, 1, function(x) as.numeric(any(x %in% vec)))
Just for fun, a couple of other interesting variants:
apply(df, 1, function(x) any(x %in% vec) %/% TRUE)
apply(df, 1, function(x) cumprod(any(x %in% vec)))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With