Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get runs of consecutive integers of certain length and sample from first values

Tags:

r

vector

sequence

I am trying to create a function that will return the first integer of a subset of a vector such that the values of the subset are discrete, increasing by 1, and of a specified length.

For example, using the input data 'v' and a specified length 'l' of 3:

v <- c(3, 4, 5, 6, 15, 16, 25, 26, 27)
l <- 3

The possible sub-vectors of consecutive values of length 3 would be:

c(3, 4, 5)
c(4, 5, 6)
c(25, 26, 27)

Then I want to randomly choose one of these vectors and return the first/lowest number, i.e. 3, 4, or 25.

like image 841
mallard Avatar asked Jun 12 '20 14:06

mallard


Video Answer


2 Answers

Here's an approach with base R:

First, we create all possible sub-vectors of length length. Next, we subset that list of vectors based on the cumsum of their difference equalling 1. The is.na test ensures the last vectors which contain NA are also filtered out. Then we just bind the remaining vectors into a matrix and sample the first column.

SampleSequencialVectors <- function(vec, length){
  all.vecs <- lapply(seq_along(vec),function(x)vec[x:(x+(length-1))])
  seq.vec <- all.vecs[sapply(all.vecs,function(x) all(diff(x) == 1 & !is.na(diff(x))))]
  sample(do.call(rbind,seq.vec)[,1],1)
}

replicate(10, SampleSequencialVectors(v, 3))
# [1]  3  4  3  3  4  4 25 25  3 25

Or if you prefer a tidyverse type approach:

SampleSequencialVectorsPurrr <- function(vec, length){
  vec %>%
    seq_along %>%
    purrr::map(~vec[.x:(.x+(length-1))]) %>%
    purrr::keep(~ all(diff(.x) == 1 & !is.na(diff(.x)))) %>%
    purrr::invoke(rbind,.) %>%
    {sample(.[,1],size = 1)}
}
replicate(10, SampleSequencialVectorsPurrr(v, 3))
 [1]  4 25 25  3 25  4  4  3  4 25
like image 194
Ian Campbell Avatar answered Oct 06 '22 01:10

Ian Campbell


  1. Split the vector into runs of consecutive values*: split(v, cumsum(c(1L, diff(v) != 1)))
  2. Select runs of length above or equal to the limit: runs[lengths(runs) >= lim]
  3. From each run, select the possible first values (x[1:(length(x) - lim + 1)]).
  4. From all possible first values, sample 1.

    runs = split(v, cumsum(c(1L, diff(v) != 1)))
    
    first = lapply(runs[lengths(runs) >= lim], function(x) x[1:(length(x) - lim + 1)])
    
    sample(unlist(first), 1)
    

Here we loop over runs of sufficient length, and not all individual values (see the other answers), thus it may be faster on larger vectors (haven't tested).


Slightly more compact using data.table:

 sample(data.table(v)[ , if(.N >= 3) v[1:(length(v) - lim + 1)],
                       by = .(cumsum(c(1L, diff(v) != 1)))]$V1, 1)

*Credits to the nice canonical: How to split a vector into groups of consecutive sequences?.

like image 37
Henrik Avatar answered Oct 05 '22 23:10

Henrik