Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Extract columns of a dataframe through string matching

Tags:

r

match

subset

I have a dataframe where the variables are character-strings. How can I extract only those columns where at least one value matches a particular string? For example, in the dataframe below, I want a match of the string "AB", i.e. I want to subset out another dataframe containing the columns V1, V2 and V5.

V1      V2      V3      V4      V5
ABCD    ABEF    EFGJ    AFASD   JLKJLXKJ
LKJAF   ROGIJ   GREJWI  SDFS    ABKLJKJX
AFSD    JLASDF  JKLJ    OIJPOI  AFSD
like image 294
user702432 Avatar asked Dec 26 '22 16:12

user702432


2 Answers

First you can apply grepl with required pattern to each column:

> sapply(data, function (x) grepl('AB', x))
        V1    V2    V3    V4    V5
[1,]  TRUE  TRUE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE  TRUE
[3,] FALSE FALSE FALSE FALSE FALSE

You can simplify above result by wrapping grepl call with any

> sapply(data, function (x) any(grepl('AB', x)))
   V1    V2    V3    V4    V5 
TRUE  TRUE FALSE FALSE  TRUE 

With vector like this you can easily extract required columns:

data[, sapply(data, function (x) any(grepl('AB', x)))]

And the result is:

     V1     V2       V5
1  ABCD   ABEF JLKJLXKJ
2 LKJAF  ROGIJ ABKLJKJX
3  AFSD JLASDF     AFSD
like image 93
zero323 Avatar answered Dec 28 '22 08:12

zero323


At this point, my answer doesn't add much, but I was on my phone when I posted the comment, so I didn't feel comfortable posting an actual answer.

Anyway, here's what I would have suggested. It's pretty much the same concept as @zero323's answer, but uses sapply or vapply instead of apply, since those are likely to be more efficient on columns of a data.frame:

mydf[vapply(mydf, function(x) any(grepl("AB", x)), vector(length = 1))]

or

mydf[sapply(mydf, function(x) any(grepl("AB", x)))]

To show the speed difference, let's try it on a larger data.frame, this one being 500 rows by 500 columns.

library(microbenchmark)
fun1a <- function() mydf[vapply(mydf, function(x) any(grepl("AB", x)), vector(length = 1))]
fun1b <- function() mydf[sapply(mydf, function(x) any(grepl("AB", x)))]
fun2 <- function() mydf[, apply(mydf, 2, function (x) any(grepl('AB', x)))]

set.seed(1)
nrow <- 500
ncol <- 500
x <- sample(8, nrow*ncol, replace = TRUE)
y <- lapply(x, function(z) paste(sample(LETTERS, z, replace = TRUE), collapse = ""))
mydf <- data.frame(matrix(unlist(y, use.names = FALSE), nrow = nrow))

microbenchmark(fun1a(), fun1b(), fun2(), times = 10)
# Unit: milliseconds
#     expr       min        lq    median       uq      max neval
#  fun1a()  75.46204  82.84732 101.22437 115.8292 120.5349    10
#  fun1b()  75.92004  85.82025  99.31647 108.5303 310.0216    10
#   fun2() 134.82356 168.44435 182.88842 196.4751 207.9986    10
identical(fun1a(), fun2())
# [1] TRUE
identical(fun1b(), fun2())
# [1] TRUE

vapply usually gives a bit of a speed boost, but in this case, it doesn't seem to.

like image 23
A5C1D2H2I1M1N2O1R2T1 Avatar answered Dec 28 '22 10:12

A5C1D2H2I1M1N2O1R2T1