Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generating samples from a two-Gaussian mixture in r (code given in MATLAB)

I'm trying to create (in r) the equivalent to the following MATLAB function that will generate n samples from a mixture of N(m1,(s1)^2) and N(m2, (s2)^2) with a fraction, alpha, from the first Gaussian.

I have a start, but the results are notably different between MATLAB and R (i.e., the MATLAB results give occasional values of +-8 but the R version never even gives a value of +-5). Please help me sort out what is wrong here. Thanks :-)

For Example: Plot 1000 samples from a mix of N(0,1) and N(0,36) with 95% of samples from the first Gaussian. Normalize the samples to mean zero and standard deviation one.

MATLAB

function

function y = gaussmix(n,m1,m2,s1,s2,alpha)
y = zeros(n,1);
U = rand(n,1);
I = (U < alpha)
y = I.*(randn(n,1)*s1+m1) + (1-I).*(randn(n,1)*s2 + m2);

implementation

P = gaussmix(1000,0,0,1,6,.95)
P = (P-mean(P))/std(P)
plot(P)
axis([0 1000 -15 15])
hist(P)
axis([-15 15 0 1000])

resulting plot

plot of randomly generated samples from two Gaussian distributions in MATLAB

resulting hist

histogram of randomly generated samples from two Gaussian distributions in MATLAB

R

yn <- rbinom(1000, 1, .95)
s <- rnorm(1000, 0 + 0*yn, 1 + 36*yn)
sn <- (s-mean(s))/sd(s)
plot(sn, xlim=range(0,1000), ylim=range(-15,15))
hist(sn, xlim=range(-15,15), ylim=range(0,1000))

resulting plot

plot of randomly generated samples from two Gaussian distributions in R

resulting hist

histogram of randomly generated samples from two Gaussian distributions in R

As always, THANK YOU!

SOLUTION

gaussmix <- function(nsim,mean_1,mean_2,std_1,std_2,alpha){
   U <- runif(nsim)
   I <- as.numeric(U<alpha)
   y <- I*rnorm(nsim,mean=mean_1,sd=std_1)+
       (1-I)*rnorm(nsim,mean=mean_2,sd=std_2)
   return(y)
}

z1 <- gaussmix(1000,0,0,1,6,0.95)
z1_standardized <- (z1-mean(z1))/sqrt(var(z1))
z2 <- gaussmix(1000,0,3,1,1,0.80)
z2_standardized <- (z2-mean(z2))/sqrt(var(z2))
z3 <- rlnorm(1000)
z3_standardized <- (z3-mean(z3))/sqrt(var(z3))

par(mfrow=c(2,3))
hist(z1_standardized,xlim=c(-10,10),ylim=c(0,500),
   main="Histogram of 95% of N(0,1) and 5% of N(0,36)",
   col="blue",xlab=" ")
hist(z2_standardized,xlim=c(-10,10),ylim=c(0,500),
   main="Histogram of 80% of N(0,1) and 10% of N(3,1)",
   col="blue",xlab=" ")
hist(z3_standardized,xlim=c(-10,10),ylim=c(0,500),
   main="Histogram of samples of LN(0,1)",col="blue",xlab=" ")
##
plot(z1_standardized,type='l',
   main="1000 samples from a mixture N(0,1) and N(0,36)",
   col="blue",xlab="Samples",ylab="Mean",ylim=c(-10,10))
plot(z2_standardized,type='l',
   main="1000 samples from a mixture N(0,1) and N(3,1)",
   col="blue",xlab="Samples",ylab="Mean",ylim=c(-10,10))
plot(z3_standardized,type='l',
  main="1000 samples from LN(0,1)",
   col="blue",xlab="Samples",ylab="Mean",ylim=c(-10,10))
like image 248
ruya Avatar asked Sep 16 '12 19:09

ruya


People also ask

How do you sample from the Gaussian mixture model?

To sample a point from the GMM, first choose a mixture component by drawing j from the categorical distribution with probabilities [π1,…,πd]. This can be done using a random number generator for the categorical distribution.

How can a Gaussian mixture model be used for clustering?

Gaussian mixture models (GMMs) are often used for data clustering. You can use GMMs to perform either hard clustering or soft clustering on query data. To perform hard clustering, the GMM assigns query data points to the multivariate normal components that maximize the component posterior probability, given the data.


2 Answers

There are two problems, I think ... (1) your R code is creating a mixture of normal distributions with standard deviations of 1 and 37. (2) By setting prob equal to alpha in your rbinom() call, you're getting a fraction alpha in the second mode rather than the first. So what you are getting is a distribution that is mostly a Gaussian with sd 37, contaminated by a 5% mixture of Gaussian with sd 1, rather than a Gaussian with sd 1 that is contaminated by a 5% mixture of a Gaussian with sd 6. Scaling by the standard deviation of the mixture (which is about 36.6) basically reduces it to a standard Gaussian with a slight bump near the origin ...

(The other answers posted here do solve your problem perfectly well, but I thought you might be interested in a diagnosis ...)

A more compact (and perhaps more idiomatic) version of your Matlab gaussmix function (I think runif(n)<alpha is slightly more efficient than rbinom(n,size=1,prob=alpha) )

gaussmix <- function(n,m1,m2,s1,s2,alpha) {
    I <- runif(n)<alpha
    rnorm(n,mean=ifelse(I,m1,m2),sd=ifelse(I,s1,s2))
}
set.seed(1001)
s <- gaussmix(1000,0,0,1,6,0.95)
like image 67
Ben Bolker Avatar answered Sep 30 '22 12:09

Ben Bolker


Not that you asked for it, but the mclust package offers a way to generalize your problem to more dimensions and diverse covariance structures. See ?mclust::sim. The example task would be done this way:

require(mclust)
simdata = sim(modelName = "V",
              parameters = list(pro = c(0.95, 0.05),
                                mean = c(0, 0),
                                variance = list(modelName = "V", 
                                                d = 1, 
                                                G = 2,
                                                sigmasq = c(0, 36))),
              n = 1000)
plot(scale(simdata[,2]), type = "h")
like image 33
Zoë Clark Avatar answered Sep 30 '22 12:09

Zoë Clark