My question is extremely closely related to this one:
Split a vector into chunks in R
I'm trying to split a large vector into known chunk sizes and it's slow. A solution for vectors with even remainders is here:
A quick solution when a factor exists is here:
Split dataframe into equal parts based on length of the dataframe
I would like to handle the case of no (large) factor existing, as I would like fairly large chunks.
My example for a vector much smaller than the one in my real life application:
d <- 1:6510321
# Sloooow
chunks <- split(d, ceiling(seq_along(d)/2000))
Using llply
from the plyr
package I was able to reduce the time.
chunks <- function(d, n){
chunks <- split(d, ceiling(seq_along(d)/n))
names(chunks) <- NULL
return(chunks)
}
require(plyr)
plyrChunks <- function(d, n){
is <- seq(from = 1, to = length(d), by = ceiling(n))
if(tail(is, 1) != length(d)) {
is <- c(is, length(d))
}
chunks <- llply(head(seq_along(is), -1),
function(i){
start <- is[i];
end <- is[i+1]-1;
d[start:end]})
lc <- length(chunks)
td <- tail(d, 1)
chunks[[lc]] <- c(chunks[[lc]], td)
return(chunks)
}
# testing
d <- 1:6510321
n <- 2000
system.time(chks <- chunks(d,n))
# user system elapsed
# 5.472 0.000 5.472
system.time(plyrChks <- plyrChunks(d, n))
# user system elapsed
# 0.068 0.000 0.065
identical(chks, plyrChks)
# TRUE
You can speed even more using the .parallel
parameter from the llpyr
function. Or you can add a progress bar using the .progress
parameter.
A speed improvement from the parallel package:
chunks <- parallel::splitIndices(6510321, ncl = ceiling(6510321/2000))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With