I have about 300 files, each containing 1000 time series realisations (~76 MB each file).
I want to calculate the quantiles (0.05, 0.50, 0.95) at each time step from the full set of 300000 realisations.
I cannot merge together the realisations in 1 file because it would become too large.
What's the most efficient way of doing this?
Each matrix is generated by running a model, however here is a sample containing random numbers:
x <- matrix(rexp(10000000, rate=.1), nrow=1000)
There are at least three options:
Edit: Example of (3).
Note that I am not a champion algorithm designer and that someone has almost certainly designed a better algorithm for this. Also, this implementation is not particularly efficient. If speed matters to you, consider Rcpp, or even just more optimized R for this. Making a bunch of lists and then extracting values from them is not so smart, but it was easy to prototype this way so I went with it.
library(plyr)
set.seed(1)
# -- Configuration -- #
desiredQuantile <- .25
# -- Generate sample data -- #
# Use some algorithm (sampling, iteration, or something else to come up with a range you're sure the true value lies within)
guessedrange <- c( .2, .3 )
# Group the observations to correspond to the OP's files
dat <- data.frame( group = rep( seq(100), each=100 ), value = runif(10000) )
# -- Apply the algorithm -- #
# Count the number above/below and return the values within the range, by group
res <- dlply( dat, .( group ), function( x, guessedrange ) {
above <- x$value > guessedrange[2]
below <- x$value < guessedrange[1]
list(
aboveCount = sum( above ),
belowCount = sum( below ),
withinValues = x$value[ !above & !below ]
)
}, guessedrange = guessedrange )
# Exract the count of values below and the values within the range
belowCount <- sum( sapply( res, function(x) x$belowCount ) )
belowCount
withinValues <- do.call( c, sapply( res, function(x) x$withinValues ) )
str(withinValues)
# Count up until we find the within value we want
desiredQuantileCount <- floor( desiredQuantile * nrow(dat) ) #! Should fix this so it averages when there's a tie
sort(withinValues)[ desiredQuantileCount - belowCount + 1 ]
# Compare to exact value
quantile( dat$value, desiredQuantile )
In the end, the value is a little off from the exact version. I suspect I'm shifted over by one or some equally silly explanation, but maybe I'm missing something fundamental.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With