I have a vector v where each entry is one or more strings (or possibly character(0)) seperated by semicolons:
ABC
DEF;ABC;QWE
TRF
character(0)
ABC;GFD
I need to find the indices of the vector which contain "ABC" (1,2,5 or a logical vector T,T,F,F,T) after splitting on ";"
I am currently using a loop as follows:
toSelect=integer(0)
for(i in c(1:length(v))){
if(length(v[i])==0) next
words=strsplit(v[i],";")[[1]]
if(!is.na(match("ABC",words))) toSelect=c(toSelect,i)
}
Unfortunately, my vector has 450k entries, so this takes far too long. I would prefer create a logical vector by doing something like
toSelect=(!is.na(match("ABC",strsplit(v,";")))
But since strsplit returns a list, I can't find a way to properly format strsplit(v,";") as a vector (unlist won't do since it would ruin the indices). Does anybody have any ideas on how to speed up this code?
Thanks!
Use regular expressions:
v = list("ABC", "DEF;ABC;QWE", "TRF", character(0), "ABC;GFD")
grep("(^|;)ABC($|;)", v)
#[1] 1 2 5
The tricky part is dealing with character(0), which @BlueMagister fudges by replacing it with character(1) (this allows use of a vector, but doesn't allow representation of the original problem). Perhaps
v <- list("ABC", "DEF;ABC;QWE", "TRF", character(0), "ABC;GFD")
v[sapply(v, length) != 0] <- strsplit(unlist(v), ";", fixed=TRUE)
to do the string split. One might proceed in base R, but I'd recommend the IRanges package
source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")
to install, then
library(IRanges)
w = CharacterList(v)
which gives a list-like structure where all elements must be character vectors.
> w
CharacterList of length 5
[[1]] ABC
[[2]] DEF ABC QWE
[[3]] TRF
[[4]] character(0)
[[5]] ABC GFD
One can then do fun things like ask "are element members equal to ABC"
> w == "ABC"
LogicalList of length 5
[[1]] TRUE
[[2]] FALSE TRUE FALSE
[[3]] FALSE
[[4]] logical(0)
[[5]] TRUE FALSE
or "are any element members equal to ABC"
> any(w == "ABC")
[1] TRUE TRUE FALSE FALSE TRUE
This will scale very well. For operations not supported "out of the box", the strategy (computationally cheap) is to unlist then transform to an equal-length vector then relist using the original CharacterList as a skeleton, for instance to use reverse on each member:
> relist(reverse(unlist(w)), w)
CharacterList of length 5
[[1]] CBA
[[2]] FED CBA EWQ
[[3]] FRT
[[4]] character(0)
[[5]] CBA DFG
As @eddi points out, this is slower than grep. The motivation is (a) to avoid needing to formulate complicated regular expressions while (b) gaining flexibility for other operations one might like to do on data structured like this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With