Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - random distribution with predefined min, max, mean, and sd values

I want to generate a random distribution of say 10,000 numbers with predefined min, max, mean, and sd values. I have followed this link setting upper and lower limits in rnorm to get random distribution with fixed min and max values. However, in doing so, mean value changes.

For example,

#Function to generate values between a lower limit and an upper limit.
mysamp <- function(n, m, s, lwr, upr, nnorm) {
set.seed(1)
samp <- rnorm(nnorm, m, s)
samp <- samp[samp >= lwr & samp <= upr]
if (length(samp) >= n) {
return(sample(samp, n))
}  
stop(simpleError("Not enough values to sample from. Try increasing nnorm."))
} 
Account_Value <- mysamp(n=10000, m=1250000, s=4500000, lwr=50000, upr=5000000, nnorm=1000000)
summary(Account_Value)

# Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 50060 1231000 2334000 2410000 3582000 5000000
#Note - though min and max values are good, mean value is very skewed for an obvious reason.
# sd(Account_Value) # 1397349

I am not sure whether we can generate a random normal distribution that meets all conditions. If there is any other sort of random distribution that can meet all conditions, please do share too.

Look forward to your inputs.

-Thank you.

like image 777
skumar Avatar asked Jun 12 '16 11:06

skumar


People also ask

How do I create a random distribution in R?

Random numbers from a normal distribution can be generated using rnorm() function. We need to specify the number of samples to be generated. We can also specify the mean and standard deviation of the distribution. If not provided, the distribution defaults to 0 mean and 1 standard deviation.

What does Rnorm do in R?

rnorm is the R function that simulates random variates having a specified normal distribution. As with pnorm , qnorm , and dnorm , optional arguments specify the mean and standard deviation of the distribution.


2 Answers

You could use a generalized form of the beta distribution, known as the Pearson type I distribution. The standard beta distribution is defined on the interval (0,1), but you can take a linear transformation of a standard beta distributed variable to obtain values between any (min, max). The answer to this question on CrossValidated explains how to parameterize a beta distribution with its mean and variance, with certain constraints.

While it's possible to formulate both a truncated normal and a generalized beta distribution with the desired min, max, mean and sd, the shape of the two distributions will be very different. This is because the truncated normal distribution has a positive probability density at the endpoints of its support interval, while in the generalized beta distribution the density will always fall smoothly to zero at the endpoints. Which shape is more preferable will depend on your intended application.

Here's an implementation in R for generating generalized beta distributed observations with a mean, variance, min and max parameterization.

rgbeta <- function(n, mean, var, min = 0, max = 1)
{
  dmin <- mean - min
  dmax <- max - mean

  if (dmin <= 0 || dmax <= 0)
  {
    stop(paste("mean must be between min =", min, "and max =", max)) 
  }

  if (var >= dmin * dmax)
  {
    stop(paste("var must be less than (mean - min) * (max - mean) =", dmin * dmax))
  }

  # mean and variance of the standard beta distributed variable
  mx <- (mean - min) / (max - min)
  vx <- var / (max - min)^2

  # find the corresponding alpha-beta parameterization
  a <- ((1 - mx) / vx - 1 / mx) * mx^2
  b <- a * (1 / mx - 1)

  # generate standard beta observations and transform
  x <- rbeta(n, a, b)
  y <- (max - min) * x + min

  return(y)
}

set.seed(1)

n <- 10000
y <- rgbeta(n, mean = 1, var = 4, min = -4, max = 5)

sapply(list(mean, sd, min, max), function(f) f(y))
#    [1]  0.9921269  2.0154131 -3.8653859  4.9838290
like image 140
Mikko Marttila Avatar answered Oct 25 '22 11:10

Mikko Marttila


Discussion:

Hi. It is very interesting problem. It needs quite an effort to be solved properly and not always solution can be found.

First thing is that when you truncate a distribution (set a min and max for it) standard deviation is limited (has a maximum depending on min and max values). If you want too big value of it - you can not get it.

Second restriction limits mean. It is obvious that if you want mean below minimum and above maximum it will not work, but you may want something too close to limits and still it can not be satisfied.

Third restriction limits a combination of this parameters. Im not sure how does it work, but i am pretty sure not all the combinations may be satisfied.

But there are some combinations that may work and may be found.

Solution:

The problem is: what are the parameters: mean and sd of truncated (cut) distribution with defined limits a and b, so in the end the mean will be equal to desired_mean and standard deviation will be equal to desired_sd.

It is important that values of parameters: mean and sd are used before truncation. So that is why in the end mean and deviation are diffrent.

Below is the code that solves the problem using function optim(). It may not be the best solution for this problem, but it generally works:

require(truncnorm)

eval_function <- function(mean_sd){
    mean <- mean_sd[1]
    sd <- mean_sd[2]
    sample <- rtruncnorm(n = n, a = a, b = b, mean = mean, sd = sd)
    mean_diff <-abs((desired_mean - mean(sample))/desired_mean)
    sd_diff <- abs((desired_sd - sd(sample))/desired_sd)
    mean_diff + sd_diff
}

n = 1000
a <- 1
b <- 6
desired_mean <- 3
desired_sd <- 1

set.seed(1)
o <- optim(c(desired_mean, desired_sd), eval_function)

new_n <- 10000
your_sample <- rtruncnorm(n = new_n, a = a, b = b, mean = o$par[1], sd = o$par[2])
mean(your_sample)
sd(your_sample)
min(your_sample)
max(your_sample)
eval_function(c(o$par[1], o$par[2]))

I am very interested if there are other solutions to that problem, so please post them if you find other answers.

EDIT:

@Mikko Marttila: Thanks to your comment and link: Wikipedia I implemented formulas to calculate mean and sd of truncated distribution. Now the solution is WAY more elegant and it should calculate quite accurately mean and sd of the desired distribution if they exist. It works much faster also.

I implemented eval_function2 which should be used in the optim() function instead of previous one:

eval_function2 <- function(mean_sd){
    mean <- mean_sd[1]
    sd <- mean_sd[2]

    alpha <- (a - mean)/sd
    betta <- (b - mean)/sd

    trunc_mean <- mean + sd * (dnorm(alpha, 0, 1) - dnorm(betta, 0, 1)) / 
                  (pnorm(betta, 0, 1) - pnorm(alpha, 0, 1))

    trunc_var <- (sd ^ 2) * 
                 (1 + 
                  (alpha * dnorm(alpha, 0, 1) - betta * dnorm(betta, 0, 1))/
                  (pnorm(betta, 0, 1) - pnorm(alpha, 0, 1)) -
                 (dnorm(alpha, 0, 1) - dnorm(betta, 0, 1))/
                 (pnorm(betta, 0, 1) - pnorm(alpha, 0, 1)))

    trunc_sd <- trunc_var ^ 0.5

    mean_diff <-abs((desired_mean - trunc_mean)/desired_mean)
    sd_diff <- abs((desired_sd - trunc_sd)/desired_sd)
}
like image 29
cure Avatar answered Oct 25 '22 12:10

cure