Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to index a vector sequence within a vector sequence

I have a solution to a problem that involves looping, and works, but I feel I am missing something that involves a more efficient implementation. The problem: I have a numeric vector sequence, and want to identify the starting position(s) in another vector of the first vector.

It works like this:

# helper function for matchSequence
# wraps a vector by removing the first n elements and padding end with NAs
wrapVector <- function(x, n) {
    stopifnot(n <= length(x))
    if (n == length(x)) 
        return(rep(NA, n))
    else
        return(c(x[(n+1):length(x)], rep(NA, n)))
}

wrapVector(LETTERS[1:5], 1)
## [1] "B" "C" "D" "E" NA
wrapVector(LETTERS[1:5], 2)
## [1] "C" "D" "E" NA  NA

# returns the starting index positions of the sequence found in a vector
matchSequence <- function(seq, vec) {
    matches <- seq[1] == vec
    if (length(seq) == 1) return(which(matches))
    for (i in 2:length(seq)) {
        matches <- cbind(matches, seq[i] == wrapVector(vec, i - 1))
    }
    which(rowSums(matches) == i)
}

myVector <- c(3, NA, 1, 2, 4, 1, 1, 2)
matchSequence(1:2, myVector)
## [1] 3 7
matchSequence(c(4, 1, 1), myVector)
## [1] 5
matchSequence(1:3, myVector)
## integer(0)

Is there a better way to implement matchSequence()?

Added

"Better" here can mean using more elegant methods I didn't think of, but even better, would mean faster. Try comparing solutions to:

set.seed(100)
myVector2 <- sample(c(NA, 1:4), size = 1000, replace = TRUE)
matchSequence(c(4, 1, 1), myVector2)
## [1]  12  48  91 120 252 491 499 590 697 771 865

microbenchmark::microbenchmark(matchSequence(c(4, 1, 1), myVector2))
## Unit: microseconds
##                                 expr     min       lq     mean   median       uq     max naval
## matchSequence(c(4, 1, 1), myVector2) 154.346 160.7335 174.4533 166.2635 176.5845 300.453   100
like image 615
Ken Benoit Avatar asked Oct 08 '15 23:10

Ken Benoit


People also ask

What are the four types of index for a vector?

Index vectors come in four different flavors – logical vectors, vectors of positive integers, vectors of negative integers, and vectors of character strings – each of which we'll cover in this lesson.

What does it mean to index a vector?

An important aspect of working with R objects is knowing how to “index” them Indexing means selecting a subset of the elements in order to use them in further analysis or possibly change them Here we focus just on three kinds of vector indexing: positional, named reference, and logical Any of these indexing techniques ...

Which method of indexing can be used to access vector elements?

Vector elements are accessed using indexing vectors, which can be numeric, character or logical vectors. You can access an individual element of a vector by its position (or "index"), indicated using square brackets. In R, the first element has an index of 1.

What is a sequence index?

Indexed sequencing is a method that allows multiple libraries to be pooled and sequenced together. Indexing libraries requires the addition of a unique identifier, or index sequence, to DNA samples during library preparation.


3 Answers

And a recursive idea (edit on Feb 5 '16 to work with NAs in pattern):

find_pat = function(pat, x) 
{
    ff = function(.pat, .x, acc = if(length(.pat)) seq_along(.x) else integer(0L)) {
        if(!length(.pat)) return(acc)

        if(is.na(.pat[[1L]])) 
            Recall(.pat[-1L], .x, acc[which(is.na(.x[acc]))] + 1L)
        else 
            Recall(.pat[-1L], .x, acc[which(.pat[[1L]] == .x[acc])] + 1L)
    }

    return(ff(pat, x) - length(pat))
}  

find_pat(1:2, myVector)
#[1] 3 7
find_pat(c(4, 1, 1), myVector)
#[1] 5
find_pat(1:3, myVector)
#integer(0)
find_pat(c(NA, 1), myVector)
#[1] 2
find_pat(c(3, NA), myVector)
#[1] 1

And on a benchmark:

all.equal(matchSequence(s, my_vec2), find_pat(s, my_vec2))
#[1] TRUE
microbenchmark::microbenchmark(matchSequence(s, my_vec2), 
                               flm(s, my_vec2), 
                               find_pat(s, my_vec2), 
                               unit = "relative")
#Unit: relative
#                      expr      min       lq   median       uq      max neval
# matchSequence(s, my_vec2) 2.970888 3.096573 3.068802 3.023167 12.41387   100
#           flm(s, my_vec2) 1.140777 1.173043 1.258394 1.280753 12.79848   100
#      find_pat(s, my_vec2) 1.000000 1.000000 1.000000 1.000000  1.00000   100

Using larger data:

set.seed(911); VEC = sample(c(NA, 1:3), 1e6, TRUE); PAT = c(3, 2, 2, 1, 3, 2, 2, 1, 1, 3)
all.equal(matchSequence(PAT, VEC), find_pat(PAT, VEC))
#[1] TRUE
microbenchmark::microbenchmark(matchSequence(PAT, VEC), 
                               flm(PAT, VEC), 
                               find_pat(PAT, VEC), 
                               unit = "relative", times = 20)
#Unit: relative
#                    expr       min       lq    median        uq       max neval
# matchSequence(PAT, VEC) 23.106862 20.54601 19.831344 18.677528 12.563634    20
#           flm(PAT, VEC)  2.810611  2.51955  2.963352  2.877195  1.728512    20
#      find_pat(PAT, VEC)  1.000000  1.00000  1.000000  1.000000  1.000000    20
like image 84
alexis_laz Avatar answered Nov 01 '22 05:11

alexis_laz


Here's a somewhat different idea:

f <- function(seq, vec) {
    mm <- t(embed(vec, length(seq))) == rev(seq)  ## relies on recycling of seq
    which(apply(mm, 2, all))
}

myVector <- c(3, NA, 1, 2, 4, 1, 1, 2)

f(1:2, myVector)
# [1] 3 7
f(c(4,1,1), myVector)
# [1] 5
f(1:3, myVector)
# integer(0)
like image 8
Josh O'Brien Avatar answered Nov 01 '22 05:11

Josh O'Brien


Another attempt which I believe is quicker again. This owes its speed to only checking for matches from points in the vector which match the start of the searched-for sequence.

flm <- function(sq, vec) {
  hits <- which(sq[1]==vec)
  out <- hits[
    colSums(outer(0:(length(sq)-1), hits, function(x,y) vec[x+y]) == sq)==length(sq)
  ]
  out[!is.na(out)]
}

Benchmark results:

#Unit: relative
#  expr      min       lq     mean   median       uq     max neval
# josh2 2.469769 2.393794 2.181521 2.353438 2.345911 1.51641   100
#    lm 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000   100
like image 6
thelatemail Avatar answered Nov 01 '22 05:11

thelatemail