Detect at least one match between each data frame row and values in vector

Tags:

dataframe

r

My dataframe looks like this:

x1 <- c("a", "c", "f", "j")
x2 <- c("b", "c", "g", "k")
x3 <- c("b", "d", "h", NA)
x4 <- c("a", "e", "i", NA)
df <- data.frame(x1, x2, x3, x4, stringsAsFactors=F)

df

x1 x2   x3   x4
1  a  b    b    a
2  c  c    d    e
3  f  g    h    i
4  j  k <NA> <NA>

Now I have an arbitrary vector:

vec <- c("a", "i", "s", "t", "z")

I would like to compare the vector values with each row in the data frame and create an additional column that indicates whether at least one (ANY) of the vector values was found or not.

The resulting dataframe should look like this:

  x1 x2   x3   x4 valueFound
1  a  b    b    a          1
2  c  c    d    e          0
3  f  g    h    i          1
4  j  k <NA> <NA>          0

I would like to do it without looping. Thank you very much for your support!

Rami

272

asked Nov 04 '14 12:11

Rami Al-Fahham

2 Answers

This would be faster than an apply based solution (despite it's cryptic construction):

as.numeric(rowSums(`dim<-`(as.matrix(df) %in% vec, dim(df))) >= 1)
[1] 1 0 1 0

Update -- Some benchmarks

Here, we can make up some bigger data to test on.... These benchmarks are on 100k rows.

set.seed(1)
nrow <- 100000
ncol <- 10
vec <- c("a", "i", "s", "t", "z")
df <- data.frame(matrix(sample(c(letters, NA), nrow * ncol, TRUE),
                        nrow = nrow, ncol = ncol), stringsAsFactors = FALSE)

Here are the approaches we have so far:

AM <- function() as.numeric(rowSums(`dim<-`(as.matrix(df) %in% vec, dim(df))) >= 1)
NR1 <- function() {
  apply(df,1,function(x){
    if(any(x %in% vec)){ 
      1 
    } else {
      0
    }
  })
}
NR2 <- function() apply(df, 1, function(x) any(x %in% vec) + 0)
NR3 <- function() apply(df, 1, function(x) as.numeric(any(x %in% vec)))
NR4 <- function() apply(df, 1, function(x) any(x %in% vec) %/% TRUE)
NR5 <- function() apply(df, 1, function(x) cumprod(any(x %in% vec)))
RS1 <- function() as.numeric(grepl(paste(vec, collapse="|"), do.call(paste, df)))
RS2 <- function() as.numeric(seq(nrow(df)) %in% row(df)[unlist(df) %in% vec])

I'm suspecting the NR functions will be a little slower:

system.time(NR1()) # Other NR functions are about the same
#    user  system elapsed 
#   1.172   0.000   1.196

And, similarly, Richard's second approach:

system.time(RS2())
#    user  system elapsed 
#   0.918   0.000   0.932

The grepl and this rowSum function are left for the benchmarks:

library(microbenchmark)
microbenchmark(AM(), RS1())
# Unit: milliseconds
#   expr       min       lq      mean    median       uq      max neval
#   AM()  65.75296  67.2527  92.03043  84.58111 102.3199 234.6114   100
#  RS1() 253.57360 256.6148 266.89640 260.18038 264.1531 385.6525   100

182

answered Sep 30 '22 21:09

A5C1D2H2I1M1N2O1R2T1

Here's one way to do this:

df$valueFound <- apply(df,1,function(x){
  if(any(x %in% vec)){ 
    1 
  } else {
    0
  }
})
##
> df
  x1 x2   x3   x4 valueFound
1  a  b    b    a          1
2  c  c    d    e          0
3  f  g    h    i          1
4  j  k <NA> <NA>          0

Thanks to @David Arenburg and @CathG, a couple of more concise approaches:

apply(df, 1, function(x) any(x %in% vec) + 0)
apply(df, 1, function(x) as.numeric(any(x %in% vec)))

Just for fun, a couple of other interesting variants:

apply(df, 1, function(x) any(x %in% vec) %/% TRUE)
apply(df, 1, function(x) cumprod(any(x %in% vec)))

answered Sep 30 '22 21:09

nrussell

Related questions
                            
                                Find consecutive values in vector in R [duplicate]
                            
                                Removing overly common words (occur in more than 80% of the documents) in R
                            
                                Print integer vector from Rcpp function
                            
                                VIFs returning aliased coefficients in R
                            
                                How to remove single space between text
                            
                                How to convert a rotated NetCDF back to a normal lat/lon grid?
                            
                                Extract a numeric pattern between two only underscores in string
                            
                                Find points over and under the confidence interval when using geom_stat / geom_smooth in ggplot2
                            
                                Combining low frequency counts
                            
                                R data.table %like% with logical AND
                            
                                taking inputs through pop up window in R
                            
                                R shiny gauge chart
                            
                                Include HTML file into RMarkdown document to generate HTML document
                            
                                How do I remove NAs with the tidyr::unite function?
                            
                                Write many files in a for loop
                            
                                Remove duplicates column combinations from a dataframe in R
                            
                                Reverse score a vector
                            
                                Why is vectorization faster
                            
                                Rcpp code crashes R
                            
                                Why doesn't Rtools 3.1 support C++11 on Windows

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With