Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: efficiently grep characters in rows of large data.frame

Tags:

dataframe

r

I have a data frame of character strings that is >1M rows long:

>head(df)
     A    B     C     D
1   S1   S2    U1    U2
2   S1   S2    S2    S1
3   S2   S1    S1    S2
4   S1   M2    U1    S2
5   S1   S1    M2    M1
6   M2   M2    M1    M2

I would like to identify all rows where a particular character is present (e.g., "U"). The solutions I have found so far are working, but they are very slow, for example:

matches <- apply(as.matrix(df), 1, function(x){ sum(grepl("U", x, perl=T)) > 0 })

Any idea how to improve this query? Thanks!

like image 776
jul635 Avatar asked Sep 04 '14 14:09

jul635


3 Answers

EDIT: updates to address comments:

The following is also very fast (0.31 seconds, even faster than before):

rows <- which(
  rowSums(
    `dim<-`(grepl("U", as.matrix(df), fixed=TRUE), dim(df))
  ) > 0
)

And produces the same result as previous answers. Using fixed=FALSE about doubles the time, but your example doesn't require that.

What we're doing here is cheating by applying grepl to a matrix, though really what we care about is turning df into a vector (which a matrix is), and as.matrix is one of the faster ways to do this. Then we can just run one grepl command. Finally, we use dim<- to turn the grepl vector result back into a matrix, and use rowSums to check which rows had matches.

Here are the reasons why this is much faster than your version:

  • We call grepl once, instead of a million times as you do with apply since the function apply applies gets called once for each row; grepl is vectorized which means you want to minimize how many times you call it and take advantage of the vectorization
  • We do the row match counts with rowSums instead of apply; rowSums is a much faster version of apply(x, 1, sum) (see docs for ?rowSums).

PREVIOUS ANSWER:

Here is a relatively straightforward solution that runs in 0.35 seconds on my system for a 1MM row by 4 column data frame:

rows <- which(rowSums(as.matrix(df) == "U") > 0)

To confirm

df[head(rows), ]

produces (every row has a U):

   a b c d
5  F B D U
8  R S U F
15 U L R P
20 U E E O
21 Y U D I
32 P F U H

And the data:

set.seed(1)
df <- as.data.frame(
  `names<-`(
    replicate(4, sample(LETTERS, 1e6, rep=T), simplify=F),
    letters[1:4]
  )
)
like image 118
BrodieG Avatar answered Sep 20 '22 13:09

BrodieG


library(data.table)

df = fread("~/Rscripts/SO.csv")  # fast read
x = df[, lapply(.SD, function(x) x %like% "U")] # fast grep
y = x[, rowSums(x) > 0]
z = df[y,]
like image 33
Henk Avatar answered Sep 18 '22 13:09

Henk


If you're just looking for the row index for characters, maybe try this. It should be quite a bit faster than looping.

unique(row(df)[grep("U", unlist(df))])
# [1] 1 4
like image 31
Rich Scriven Avatar answered Sep 16 '22 13:09

Rich Scriven