Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In R, split a vector randomly into k chunks?

I have seen many variations on the "split vector X into Y chunks in R" question on here. See for example: here and here for just two. So, when I realized I needed to split a vector into Y chunks of random size, I was surprised to find that the randomness requirement might be "new"--I couldn't find a way to do this on here.

So, here's what I've drawn up:

k.chunks = function(seq.size, n.chunks) {
  break.pts = sample(1:seq.size, n.chunks, replace=F) %>% sort() #Get a set of break points chosen from along the length of the vector without replacement so no duplicate selections.
  groups = rep(NA, seq.size) #Set up the empty output vector.
  groups[1:break.pts[1]] = 1 #Set the first set of group affiliations because it has a unique start point of 1.

for (i in 2:(n.chunks)) { #For all other chunks...
    groups[break.pts[i-1]:break.pts[i]] = i #Set the respective group affiliations
    }
    groups[break.pts[n.chunks]:seq.size] = n.chunks #Set the last group affiliation because it has a unique endpoint of seq.size.
    return(groups)
    }

My question is: Is this inelegant or inefficient somehow? It will get called 1000s of times in the code I plan to do, so efficiency is important to me. It'd be especially nice to avoid the for loop or having to set both the first and last groups "manually." My other question: Are there logical inputs that could break this? I recognize that n.chunks cannot > seq.size, so I mean other than that.

like image 205
Bajcz Avatar asked Apr 22 '26 12:04

Bajcz


1 Answers

That should be pretty quick for smaller numbers. But here a more concise way.

k.chunks2 = function(seq.size, n.chunks) {
  break.pts <- sort(sample(1:seq.size, n.chunks - 1, replace = FALSE))
  break.len <- diff(c(0, break.pts, seq.size))
  
  groups <- rep(1:n.chunks, times = break.len)
  return(groups)
}

If you really get a huge number of groups, I think the sort will start to cost you execution time. So you can do something like this (probably can be tweaked to be even faster) to split based on proportions. I am not sure how I feel about this, because as n.chunks gets very large, the proportions will get very small. But it is faster.

k.chunks3 = function(seq.size, n.chunks) {
  props <- runif(n.chunks)
  grp.props <- props / sum(props)
  
  chunk.size <- floor(grp.props[-n.chunks] * seq.size)
  break.len <- c(chunk.size, seq.size - sum(chunk.size))
  
  groups <- rep(1:n.chunks, times = break.len)
  return(groups)
}

Running a benchmark, I think any of these will be fast enough (unit is microseconds).

n <- 1000
y <- 10

microbenchmark::microbenchmark(k.chunks(n, y),
                               k.chunks2(n, y),
                               k.chunks3(n, y))

Unit: microseconds
            expr  min    lq   mean median    uq   max neval
  k.chunks(n, y) 49.9 52.05 59.613  53.45 58.35 251.7   100
 k.chunks2(n, y) 46.1 47.75 51.617  49.25 52.55 107.1   100
 k.chunks3(n, y)  8.1  9.35 11.412  10.80 11.75  44.2   100

But as the numbers get larger, you will notice a meaningful speedup (note the unit is now milliseconds).

n <- 1000000
y <- 100000

microbenchmark::microbenchmark(k.chunks(n, y),
                               k.chunks2(n, y),
                               k.chunks3(n, y))

Unit: milliseconds
            expr     min       lq     mean   median       uq      max neval
  k.chunks(n, y) 46.9910 51.38385 57.83917 54.54310 56.59285 113.5038   100
 k.chunks2(n, y) 17.2184 19.45505 22.72060 20.74595 22.73510  69.5639   100
 k.chunks3(n, y)  7.7354  8.62715 10.32754  9.07045 10.44675  58.2093   100

All said and done, I would probably use my k.chunks2() function.