I have a data frame of character strings that is >1M rows long:
>head(df)
A B C D
1 S1 S2 U1 U2
2 S1 S2 S2 S1
3 S2 S1 S1 S2
4 S1 M2 U1 S2
5 S1 S1 M2 M1
6 M2 M2 M1 M2
I would like to identify all rows where a particular character is present (e.g., "U"). The solutions I have found so far are working, but they are very slow, for example:
matches <- apply(as.matrix(df), 1, function(x){ sum(grepl("U", x, perl=T)) > 0 })
Any idea how to improve this query? Thanks!
EDIT: updates to address comments:
The following is also very fast (0.31 seconds, even faster than before):
rows <- which(
rowSums(
`dim<-`(grepl("U", as.matrix(df), fixed=TRUE), dim(df))
) > 0
)
And produces the same result as previous answers. Using fixed=FALSE
about doubles the time, but your example doesn't require that.
What we're doing here is cheating by applying grepl
to a matrix, though really what we care about is turning df
into a vector (which a matrix is), and as.matrix
is one of the faster ways to do this. Then we can just run one grepl
command. Finally, we use dim<-
to turn the grepl
vector result back into a matrix, and use rowSums
to check which rows had matches.
Here are the reasons why this is much faster than your version:
grepl
once, instead of a million times as you do with apply
since the function apply
applies gets called once for each row; grepl
is vectorized which means you want to minimize how many times you call it and take advantage of the vectorizationrowSums
instead of apply
; rowSums
is a much faster version of apply(x, 1, sum)
(see docs for ?rowSums
).PREVIOUS ANSWER:
Here is a relatively straightforward solution that runs in 0.35 seconds on my system for a 1MM row by 4 column data frame:
rows <- which(rowSums(as.matrix(df) == "U") > 0)
To confirm
df[head(rows), ]
produces (every row has a U):
a b c d
5 F B D U
8 R S U F
15 U L R P
20 U E E O
21 Y U D I
32 P F U H
And the data:
set.seed(1)
df <- as.data.frame(
`names<-`(
replicate(4, sample(LETTERS, 1e6, rep=T), simplify=F),
letters[1:4]
)
)
library(data.table)
df = fread("~/Rscripts/SO.csv") # fast read
x = df[, lapply(.SD, function(x) x %like% "U")] # fast grep
y = x[, rowSums(x) > 0]
z = df[y,]
If you're just looking for the row index for characters, maybe try this. It should be quite a bit faster than looping.
unique(row(df)[grep("U", unlist(df))])
# [1] 1 4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With