How to index a vector sequence within a vector sequence

Tags:

I have a solution to a problem that involves looping, and works, but I feel I am missing something that involves a more efficient implementation. The problem: I have a numeric vector sequence, and want to identify the starting position(s) in another vector of the first vector.

It works like this:

# helper function for matchSequence
# wraps a vector by removing the first n elements and padding end with NAs
wrapVector <- function(x, n) {
    stopifnot(n <= length(x))
    if (n == length(x)) 
        return(rep(NA, n))
    else
        return(c(x[(n+1):length(x)], rep(NA, n)))
}

wrapVector(LETTERS[1:5], 1)
## [1] "B" "C" "D" "E" NA
wrapVector(LETTERS[1:5], 2)
## [1] "C" "D" "E" NA  NA

# returns the starting index positions of the sequence found in a vector
matchSequence <- function(seq, vec) {
    matches <- seq[1] == vec
    if (length(seq) == 1) return(which(matches))
    for (i in 2:length(seq)) {
        matches <- cbind(matches, seq[i] == wrapVector(vec, i - 1))
    }
    which(rowSums(matches) == i)
}

myVector <- c(3, NA, 1, 2, 4, 1, 1, 2)
matchSequence(1:2, myVector)
## [1] 3 7
matchSequence(c(4, 1, 1), myVector)
## [1] 5
matchSequence(1:3, myVector)
## integer(0)

Is there a better way to implement matchSequence()?

Added

"Better" here can mean using more elegant methods I didn't think of, but even better, would mean faster. Try comparing solutions to:

set.seed(100)
myVector2 <- sample(c(NA, 1:4), size = 1000, replace = TRUE)
matchSequence(c(4, 1, 1), myVector2)
## [1]  12  48  91 120 252 491 499 590 697 771 865

microbenchmark::microbenchmark(matchSequence(c(4, 1, 1), myVector2))
## Unit: microseconds
##                                 expr     min       lq     mean   median       uq     max naval
## matchSequence(c(4, 1, 1), myVector2) 154.346 160.7335 174.4533 166.2635 176.5845 300.453   100

615

asked Oct 08 '15 23:10

Ken Benoit

3 Answers

And a recursive idea (edit on Feb 5 '16 to work with NAs in pattern):

find_pat = function(pat, x) 
{
    ff = function(.pat, .x, acc = if(length(.pat)) seq_along(.x) else integer(0L)) {
        if(!length(.pat)) return(acc)

        if(is.na(.pat[[1L]])) 
            Recall(.pat[-1L], .x, acc[which(is.na(.x[acc]))] + 1L)
        else 
            Recall(.pat[-1L], .x, acc[which(.pat[[1L]] == .x[acc])] + 1L)
    }

    return(ff(pat, x) - length(pat))
}  

find_pat(1:2, myVector)
#[1] 3 7
find_pat(c(4, 1, 1), myVector)
#[1] 5
find_pat(1:3, myVector)
#integer(0)
find_pat(c(NA, 1), myVector)
#[1] 2
find_pat(c(3, NA), myVector)
#[1] 1

And on a benchmark:

all.equal(matchSequence(s, my_vec2), find_pat(s, my_vec2))
#[1] TRUE
microbenchmark::microbenchmark(matchSequence(s, my_vec2), 
                               flm(s, my_vec2), 
                               find_pat(s, my_vec2), 
                               unit = "relative")
#Unit: relative
#                      expr      min       lq   median       uq      max neval
# matchSequence(s, my_vec2) 2.970888 3.096573 3.068802 3.023167 12.41387   100
#           flm(s, my_vec2) 1.140777 1.173043 1.258394 1.280753 12.79848   100
#      find_pat(s, my_vec2) 1.000000 1.000000 1.000000 1.000000  1.00000   100

Using larger data:

set.seed(911); VEC = sample(c(NA, 1:3), 1e6, TRUE); PAT = c(3, 2, 2, 1, 3, 2, 2, 1, 1, 3)
all.equal(matchSequence(PAT, VEC), find_pat(PAT, VEC))
#[1] TRUE
microbenchmark::microbenchmark(matchSequence(PAT, VEC), 
                               flm(PAT, VEC), 
                               find_pat(PAT, VEC), 
                               unit = "relative", times = 20)
#Unit: relative
#                    expr       min       lq    median        uq       max neval
# matchSequence(PAT, VEC) 23.106862 20.54601 19.831344 18.677528 12.563634    20
#           flm(PAT, VEC)  2.810611  2.51955  2.963352  2.877195  1.728512    20
#      find_pat(PAT, VEC)  1.000000  1.00000  1.000000  1.000000  1.000000    20

answered Nov 01 '22 05:11

alexis_laz

Here's a somewhat different idea:

f <- function(seq, vec) {
    mm <- t(embed(vec, length(seq))) == rev(seq)  ## relies on recycling of seq
    which(apply(mm, 2, all))
}

myVector <- c(3, NA, 1, 2, 4, 1, 1, 2)

f(1:2, myVector)
# [1] 3 7
f(c(4,1,1), myVector)
# [1] 5
f(1:3, myVector)
# integer(0)

answered Nov 01 '22 05:11

Josh O'Brien

Another attempt which I believe is quicker again. This owes its speed to only checking for matches from points in the vector which match the start of the searched-for sequence.

flm <- function(sq, vec) {
  hits <- which(sq[1]==vec)
  out <- hits[
    colSums(outer(0:(length(sq)-1), hits, function(x,y) vec[x+y]) == sq)==length(sq)
  ]
  out[!is.na(out)]
}

Benchmark results:

#Unit: relative
#  expr      min       lq     mean   median       uq     max neval
# josh2 2.469769 2.393794 2.181521 2.353438 2.345911 1.51641   100
#    lm 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000   100

answered Nov 01 '22 05:11

thelatemail

Related questions
                            
                                Mutating column in `dplyr` using `rowSums`
                            
                                Why are my functions on lubridate dates so slow?
                            
                                How to swap (translate) values inside a vector
                            
                                Lagging Variables in R
                            
                                Calculating standard deviation of each row
                            
                                Convert row names in multiple data frames to column in data frame
                            
                                Make the border on one bar darker than the others
                            
                                Error when using predict() on a randomForest object trained with caret's train() using formula
                            
                                dplyr + ggplot2: Plotting not working via piping
                            
                                Centered X-axis label for muliplot using cowplot package
                            
                                number of unique elements excluding NA in r
                            
                                Unicode with knitr and Rmarkdown
                            
                                Shiny R - download the result of a table
                            
                                How to find the indices where there are n consecutive zeroes in a row
                            
                                Is there an R function for the element-wise summation of the matrices stored as elements in single list object? [duplicate]
                            
                                convert string date to R Date FAST for all dates
                            
                                How logical negation operator "!" works
                            
                                Power regression in R similar to excel
                            
                                using column names when appending data in write.table
                            
                                Remove NA from list of lists

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to index a vector sequence within a vector sequence

Tags:

performance

r

vector

Ken Benoit

People also ask

3 Answers

alexis_laz

Josh O'Brien

thelatemail

Recent Activity

Donate For Us