Let's say I have a set of numbers that I suspect come from the same distribution.
set.seed(20130613)
x <- rcauchy(10)
I would like a function that randomly generates a number from that same unknown distribution. One approach I have thought of is to create a density
object and then get the CDF from that and take the inverse CDF of a random uniform variable (see Wikipedia).
den <- density(x)
#' Generate n random numbers from density() object
#'
#' @param n The total random numbers to generate
#' @param den The density object from which to generate random numbers
rden <- function(n, den)
{
diffs <- diff(den$x)
# Making sure we have equal increments
stopifnot(all(abs(diff(den$x) - mean(diff(den$x))) < 1e-9))
total <- sum(den$y)
den$y <- den$y / total
ydistr <- cumsum(den$y)
yunif <- runif(n)
indices <- sapply(yunif, function(y) min(which(ydistr > y)))
x <- den$x[indices]
return(x)
}
rden(1, den)
## [1] -0.1854121
My questions are the following:
sample
)?To generate data from a density estimate you just randomly choose one of the original data points and add a random "error" piece based on the kernel from the density estimate, for the default of "Gaussian" this just means choose a random element from the original vector and add a random normal with mean 0 and sd equal to the bandwidth used:
den <- density(x)
N <- 1000
newx <- sample(x, N, replace=TRUE) + rnorm(N, 0, den$bw)
Another option is to fit a density using the logspline
function from the logspline
package (uses a different method of estimating a density), then use the rlogspline
function in that package to generate new data from the estimated density.
If all you need is to draw values from your existing pool of numbers, then sample
is the way to go.
If you want to draw from the presumed underlying distribution, then use density
, and fit that to your presumed distribution to get the necessary coefficients (mean, sd, etc.), and use the appropriate R
distribution function.
Beyond that, I'd take a look at Chapter7.3 ("rejection method") of Numerical Recipes in C for ways to "selectively" sample according to any distribution. The code is simple enough to be easily translated into R
.
My bet is someone already has done so and will post a better answer than this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With