Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Avoiding a loop on a strsplit list

Tags:

r

I have a vector v where each entry is one or more strings (or possibly character(0)) seperated by semicolons:

ABC

DEF;ABC;QWE

TRF

character(0)

ABC;GFD

I need to find the indices of the vector which contain "ABC" (1,2,5 or a logical vector T,T,F,F,T) after splitting on ";"

I am currently using a loop as follows:

toSelect=integer(0)
for(i in c(1:length(v))){
if(length(v[i])==0) next
words=strsplit(v[i],";")[[1]] 
if(!is.na(match("ABC",words))) toSelect=c(toSelect,i)
}

Unfortunately, my vector has 450k entries, so this takes far too long. I would prefer create a logical vector by doing something like

toSelect=(!is.na(match("ABC",strsplit(v,";")))

But since strsplit returns a list, I can't find a way to properly format strsplit(v,";") as a vector (unlist won't do since it would ruin the indices). Does anybody have any ideas on how to speed up this code?

Thanks!

like image 415
user2359686 Avatar asked Feb 10 '26 20:02

user2359686


2 Answers

Use regular expressions:

v = list("ABC", "DEF;ABC;QWE", "TRF", character(0), "ABC;GFD")
grep("(^|;)ABC($|;)", v)
#[1] 1 2 5
like image 62
eddi Avatar answered Feb 12 '26 16:02

eddi


The tricky part is dealing with character(0), which @BlueMagister fudges by replacing it with character(1) (this allows use of a vector, but doesn't allow representation of the original problem). Perhaps

v <- list("ABC", "DEF;ABC;QWE", "TRF", character(0), "ABC;GFD")
v[sapply(v, length) != 0] <- strsplit(unlist(v), ";", fixed=TRUE)

to do the string split. One might proceed in base R, but I'd recommend the IRanges package

source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")

to install, then

library(IRanges)
w = CharacterList(v)

which gives a list-like structure where all elements must be character vectors.

> w
CharacterList of length 5
[[1]] ABC
[[2]] DEF ABC QWE
[[3]] TRF
[[4]] character(0)
[[5]] ABC GFD

One can then do fun things like ask "are element members equal to ABC"

> w == "ABC"
LogicalList of length 5
[[1]] TRUE
[[2]] FALSE TRUE FALSE
[[3]] FALSE
[[4]] logical(0)
[[5]] TRUE FALSE

or "are any element members equal to ABC"

> any(w == "ABC")
[1]  TRUE  TRUE FALSE FALSE  TRUE

This will scale very well. For operations not supported "out of the box", the strategy (computationally cheap) is to unlist then transform to an equal-length vector then relist using the original CharacterList as a skeleton, for instance to use reverse on each member:

> relist(reverse(unlist(w)), w)
CharacterList of length 5
[[1]] CBA
[[2]] FED CBA EWQ
[[3]] FRT
[[4]] character(0)
[[5]] CBA DFG

As @eddi points out, this is slower than grep. The motivation is (a) to avoid needing to formulate complicated regular expressions while (b) gaining flexibility for other operations one might like to do on data structured like this.

like image 24
Martin Morgan Avatar answered Feb 12 '26 14:02

Martin Morgan



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!