I have a dataframe where the variables are character-strings. How can I extract only those columns where at least one value matches a particular string? For example, in the dataframe below, I want a match of the string "AB", i.e. I want to subset out another dataframe containing the columns V1, V2 and V5.
V1 V2 V3 V4 V5
ABCD ABEF EFGJ AFASD JLKJLXKJ
LKJAF ROGIJ GREJWI SDFS ABKLJKJX
AFSD JLASDF JKLJ OIJPOI AFSD
First you can apply grepl
with required pattern to each column:
> sapply(data, function (x) grepl('AB', x))
V1 V2 V3 V4 V5
[1,] TRUE TRUE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE TRUE
[3,] FALSE FALSE FALSE FALSE FALSE
You can simplify above result by wrapping grepl
call with any
> sapply(data, function (x) any(grepl('AB', x)))
V1 V2 V3 V4 V5
TRUE TRUE FALSE FALSE TRUE
With vector like this you can easily extract required columns:
data[, sapply(data, function (x) any(grepl('AB', x)))]
And the result is:
V1 V2 V5
1 ABCD ABEF JLKJLXKJ
2 LKJAF ROGIJ ABKLJKJX
3 AFSD JLASDF AFSD
At this point, my answer doesn't add much, but I was on my phone when I posted the comment, so I didn't feel comfortable posting an actual answer.
Anyway, here's what I would have suggested. It's pretty much the same concept as @zero323's answer, but uses sapply
or vapply
instead of apply
, since those are likely to be more efficient on columns of a data.frame
:
mydf[vapply(mydf, function(x) any(grepl("AB", x)), vector(length = 1))]
or
mydf[sapply(mydf, function(x) any(grepl("AB", x)))]
To show the speed difference, let's try it on a larger data.frame
, this one being 500 rows by 500 columns.
library(microbenchmark)
fun1a <- function() mydf[vapply(mydf, function(x) any(grepl("AB", x)), vector(length = 1))]
fun1b <- function() mydf[sapply(mydf, function(x) any(grepl("AB", x)))]
fun2 <- function() mydf[, apply(mydf, 2, function (x) any(grepl('AB', x)))]
set.seed(1)
nrow <- 500
ncol <- 500
x <- sample(8, nrow*ncol, replace = TRUE)
y <- lapply(x, function(z) paste(sample(LETTERS, z, replace = TRUE), collapse = ""))
mydf <- data.frame(matrix(unlist(y, use.names = FALSE), nrow = nrow))
microbenchmark(fun1a(), fun1b(), fun2(), times = 10)
# Unit: milliseconds
# expr min lq median uq max neval
# fun1a() 75.46204 82.84732 101.22437 115.8292 120.5349 10
# fun1b() 75.92004 85.82025 99.31647 108.5303 310.0216 10
# fun2() 134.82356 168.44435 182.88842 196.4751 207.9986 10
identical(fun1a(), fun2())
# [1] TRUE
identical(fun1b(), fun2())
# [1] TRUE
vapply
usually gives a bit of a speed boost, but in this case, it doesn't seem to.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With