R: efficiently grep characters in rows of large data.frame

Question

I have a data frame of character strings that is >1M rows long:

>head(df)
     A    B     C     D
1   S1   S2    U1    U2
2   S1   S2    S2    S1
3   S2   S1    S1    S2
4   S1   M2    U1    S2
5   S1   S1    M2    M1
6   M2   M2    M1    M2

I would like to identify all rows where a particular character is present (e.g., "U"). The solutions I have found so far are working, but they are very slow, for example:

matches <- apply(as.matrix(df), 1, function(x){ sum(grepl("U", x, perl=T)) > 0 })

Any idea how to improve this query? Thanks!

BrodieG · Accepted Answer

EDIT: updates to address comments:

The following is also very fast (0.31 seconds, even faster than before):

rows <- which(
  rowSums(
    `dim<-`(grepl("U", as.matrix(df), fixed=TRUE), dim(df))
  ) > 0
)

And produces the same result as previous answers. Using fixed=FALSE about doubles the time, but your example doesn't require that.

What we're doing here is cheating by applying grepl to a matrix, though really what we care about is turning df into a vector (which a matrix is), and as.matrix is one of the faster ways to do this. Then we can just run one grepl command. Finally, we use dim<- to turn the grepl vector result back into a matrix, and use rowSums to check which rows had matches.

Here are the reasons why this is much faster than your version:

We call grepl once, instead of a million times as you do with apply since the function apply applies gets called once for each row; grepl is vectorized which means you want to minimize how many times you call it and take advantage of the vectorization
We do the row match counts with rowSums instead of apply; rowSums is a much faster version of apply(x, 1, sum) (see docs for ?rowSums).

PREVIOUS ANSWER:

Here is a relatively straightforward solution that runs in 0.35 seconds on my system for a 1MM row by 4 column data frame:

rows <- which(rowSums(as.matrix(df) == "U") > 0)

To confirm

df[head(rows), ]

produces (every row has a U):

   a b c d
5  F B D U
8  R S U F
15 U L R P
20 U E E O
21 Y U D I
32 P F U H

And the data:

set.seed(1)
df <- as.data.frame(
  `names<-`(
    replicate(4, sample(LETTERS, 1e6, rep=T), simplify=F),
    letters[1:4]
  )
)

Henk · Answer

library(data.table)

df = fread("~/Rscripts/SO.csv")  # fast read
x = df[, lapply(.SD, function(x) x %like% "U")] # fast grep
y = x[, rowSums(x) > 0]
z = df[y,]

Rich Scriven · Answer

If you're just looking for the row index for characters, maybe try this. It should be quite a bit faster than looping.

unique(row(df)[grep("U", unlist(df))])
# [1] 1 4

R: efficiently grep characters in rows of large data.frame

Tags:

dataframe

r

jul635

3 Answers

BrodieG

Henk

Rich Scriven

Recent Activity

Donate For Us

R: efficiently grep characters in rows of large data.frame

Tags:

dataframe

r

jul635

3 Answers

BrodieG

Henk

Rich Scriven

Related questions

Recent Activity

Donate For Us