Avoiding a loop on a strsplit list

Question

I have a vector v where each entry is one or more strings (or possibly character(0)) seperated by semicolons:

ABC

DEF;ABC;QWE

TRF

character(0)

ABC;GFD

I need to find the indices of the vector which contain "ABC" (1,2,5 or a logical vector T,T,F,F,T) after splitting on ";"

I am currently using a loop as follows:

toSelect=integer(0)
for(i in c(1:length(v))){
if(length(v[i])==0) next
words=strsplit(v[i],";")[[1]] 
if(!is.na(match("ABC",words))) toSelect=c(toSelect,i)
}

Unfortunately, my vector has 450k entries, so this takes far too long. I would prefer create a logical vector by doing something like

toSelect=(!is.na(match("ABC",strsplit(v,";")))

But since strsplit returns a list, I can't find a way to properly format strsplit(v,";") as a vector (unlist won't do since it would ruin the indices). Does anybody have any ideas on how to speed up this code?

Thanks!

eddi · Accepted Answer

Use regular expressions:

v = list("ABC", "DEF;ABC;QWE", "TRF", character(0), "ABC;GFD")
grep("(^|;)ABC($|;)", v)
#[1] 1 2 5

Martin Morgan · Answer

The tricky part is dealing with character(0), which @BlueMagister fudges by replacing it with character(1) (this allows use of a vector, but doesn't allow representation of the original problem). Perhaps

v <- list("ABC", "DEF;ABC;QWE", "TRF", character(0), "ABC;GFD")
v[sapply(v, length) != 0] <- strsplit(unlist(v), ";", fixed=TRUE)

to do the string split. One might proceed in base R, but I'd recommend the IRanges package

source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")

to install, then

library(IRanges)
w = CharacterList(v)

which gives a list-like structure where all elements must be character vectors.

> w
CharacterList of length 5
[[1]] ABC
[[2]] DEF ABC QWE
[[3]] TRF
[[4]] character(0)
[[5]] ABC GFD

One can then do fun things like ask "are element members equal to ABC"

> w == "ABC"
LogicalList of length 5
[[1]] TRUE
[[2]] FALSE TRUE FALSE
[[3]] FALSE
[[4]] logical(0)
[[5]] TRUE FALSE

or "are any element members equal to ABC"

> any(w == "ABC")
[1]  TRUE  TRUE FALSE FALSE  TRUE

This will scale very well. For operations not supported "out of the box", the strategy (computationally cheap) is to unlist then transform to an equal-length vector then relist using the original CharacterList as a skeleton, for instance to use reverse on each member:

> relist(reverse(unlist(w)), w)
CharacterList of length 5
[[1]] CBA
[[2]] FED CBA EWQ
[[3]] FRT
[[4]] character(0)
[[5]] CBA DFG

As @eddi points out, this is slower than grep. The motivation is (a) to avoid needing to formulate complicated regular expressions while (b) gaining flexibility for other operations one might like to do on data structured like this.

Avoiding a loop on a strsplit list

Tags:

r

user2359686

2 Answers

eddi

Martin Morgan

Recent Activity

Donate For Us

Avoiding a loop on a strsplit list

Tags:

r

user2359686

2 Answers

eddi

Martin Morgan

Related questions

Recent Activity

Donate For Us