Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: How to fit a large dataset with a combination of distributions?

To fit a dataset of real-valued numbers (x) with one distribution, we can use MASS as follows either the gamma or Student's t distribution:

fitdistr(x, "gamma")

or

fitdistr(x2, "t")

What if I believe my dataset should fit by the sum of gamma and t distributions?

P(X) = Gamma(x) + t(x)

Can I fit the parameters of mixtures of probability distributions using Maximum Likelihood fitting in R?

like image 821
user2718 Avatar asked Jun 26 '11 17:06

user2718


2 Answers

There are analytic maximum-likelihood estimators for some parameters, such as the mean of a normal distribution or the rate of an exponential distribution. For other parameters, there is no analytic estimator, but you can use numerical analysis to find reasonable parameter estimates.

The fitdistr() function in R uses numerical optimization of the log-likelihood function by calling the optim() function. If you think that your data is a mixture of Gamma and t distribution, then simply make a likelihood function that describes such a mixture. Then, pass those parameter values to optim() for optimization. Here is an example using this approach to fitting a distribution:

library( MASS )

vals = rnorm( n = 10000, mean = 0, sd = 1 ) 
print( summary(x_vals) )

ll_func = function(params) {
   log_probs = log( dnorm( x = vals, mean = params[1], sd = params[2] ))
   tot = sum(log_probs)
   return(-1 * tot)
}       

params = c( 0.5, 10 )

print( ll_func(params) )
res = optim( params, ll_func )
print( res$par )

Running this program in R produces this output:

[1] "mean: 0.0223766157516646"
[1] "sd:   0.991566611447471"

That's fairly close to the initial values of mean = 0 and sd = 1.

Don't forget that with a mixture of two distributions, you have one extra parameter that specifies the relative weights between the distributions. Also, be careful about fitting lots of parameters at once. With lots of free parameters you need to worry about overfitting.

like image 88
James Thompson Avatar answered Sep 23 '22 04:09

James Thompson


Try mixdist. Here's an example of a mixture of three distributions:

https://stats.stackexchange.com/questions/10062/which-r-package-to-use-to-calculate-component-parameters-for-a-mixture-model

like image 38
bill_080 Avatar answered Sep 22 '22 04:09

bill_080