I have seen many variations on the "split vector X into Y chunks in R" question on here. See for example: here and here for just two. So, when I realized I needed to split a vector into Y chunks of random size, I was surprised to find that the randomness requirement might be "new"--I couldn't find a way to do this on here.
So, here's what I've drawn up:
k.chunks = function(seq.size, n.chunks) {
break.pts = sample(1:seq.size, n.chunks, replace=F) %>% sort() #Get a set of break points chosen from along the length of the vector without replacement so no duplicate selections.
groups = rep(NA, seq.size) #Set up the empty output vector.
groups[1:break.pts[1]] = 1 #Set the first set of group affiliations because it has a unique start point of 1.
for (i in 2:(n.chunks)) { #For all other chunks...
groups[break.pts[i-1]:break.pts[i]] = i #Set the respective group affiliations
}
groups[break.pts[n.chunks]:seq.size] = n.chunks #Set the last group affiliation because it has a unique endpoint of seq.size.
return(groups)
}
My question is: Is this inelegant or inefficient somehow? It will get called 1000s of times in the code I plan to do, so efficiency is important to me. It'd be especially nice to avoid the for loop or having to set both the first and last groups "manually." My other question: Are there logical inputs that could break this? I recognize that n.chunks cannot > seq.size, so I mean other than that.
That should be pretty quick for smaller numbers. But here a more concise way.
k.chunks2 = function(seq.size, n.chunks) {
break.pts <- sort(sample(1:seq.size, n.chunks - 1, replace = FALSE))
break.len <- diff(c(0, break.pts, seq.size))
groups <- rep(1:n.chunks, times = break.len)
return(groups)
}
If you really get a huge number of groups, I think the sort will start to cost you execution time. So you can do something like this (probably can be tweaked to be even faster) to split based on proportions. I am not sure how I feel about this, because as n.chunks gets very large, the proportions will get very small. But it is faster.
k.chunks3 = function(seq.size, n.chunks) {
props <- runif(n.chunks)
grp.props <- props / sum(props)
chunk.size <- floor(grp.props[-n.chunks] * seq.size)
break.len <- c(chunk.size, seq.size - sum(chunk.size))
groups <- rep(1:n.chunks, times = break.len)
return(groups)
}
Running a benchmark, I think any of these will be fast enough (unit is microseconds).
n <- 1000
y <- 10
microbenchmark::microbenchmark(k.chunks(n, y),
k.chunks2(n, y),
k.chunks3(n, y))
Unit: microseconds
expr min lq mean median uq max neval
k.chunks(n, y) 49.9 52.05 59.613 53.45 58.35 251.7 100
k.chunks2(n, y) 46.1 47.75 51.617 49.25 52.55 107.1 100
k.chunks3(n, y) 8.1 9.35 11.412 10.80 11.75 44.2 100
But as the numbers get larger, you will notice a meaningful speedup (note the unit is now milliseconds).
n <- 1000000
y <- 100000
microbenchmark::microbenchmark(k.chunks(n, y),
k.chunks2(n, y),
k.chunks3(n, y))
Unit: milliseconds
expr min lq mean median uq max neval
k.chunks(n, y) 46.9910 51.38385 57.83917 54.54310 56.59285 113.5038 100
k.chunks2(n, y) 17.2184 19.45505 22.72060 20.74595 22.73510 69.5639 100
k.chunks3(n, y) 7.7354 8.62715 10.32754 9.07045 10.44675 58.2093 100
All said and done, I would probably use my k.chunks2() function.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With