I've been struggling with fitting a distribution to sample data I have in R. I've looked at using the fitdist as well as fitdistr functions, but I seem to be running into problems with both.
A quick background; the output of my code should be the most fitting distribution (from a list of distributions) to the data provided, with parameters. This needs to happen without human interaction, so comparing graphs is not an option. I was thinking that I could fit each distribution to the data, draw the p-value from the chi-squared test and find the distribution with the highest p-value. I've gotten some success in a normal distribution to the sample data, but as soon as I try to fit something more complex (a gamma distribution, as seen in the code), I get all kinds of errors. What am I doing wrong?
library(fitdistrplus)
require(MASS)
set.seed(1)
testData <- rnorm(1000)
distlist <- c("norm","unif","exp")
(z <- fitdist(testData,"gamma",start=list(rate=0.1),fix.arg=list(shape=4)))
Examples of errors I get are:
[1] "Error in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data, : \n initial value in 'vmmin' is not finite\n" attr(,"class")
and
Error in fitdist(testData, "gamma", start = list(rate = 0.1), fix.arg = list(shape = 4)) : the function mle failed to estimate the parameters, with the error code 100
I know I'm probably implementing the fitdist function incorrectly, but I can't seem to find simple examples I can adapt to achieve my code objectives. Can anyone help?
We can identify 4 steps in fitting distributions: 1) Model/function choice: hypothesize families of distributions; 2) Estimate parameters; 3) Evaluate quality of fit; 4) Goodness of fit statistical tests.
Using Probability Plots to Identify the Distribution of Your Data. Probability plots might be the best way to determine whether your data follow a particular distribution. If your data follow the straight line on the graph, the distribution fits your data. This process is simple to do visually.
To plot the probability mass function for a Poisson distribution in R, we can use the following functions: dpois(x, lambda) to create the probability mass function. plot(x, y, type = 'h') to plot the probability mass function, specifying the plot to be a histogram (type='h')
You are looking for the Kolmogorov-Smirnov test. Null hypothesis is that the data sample is from the hypothesised distribution.
fitData <- function(data, fit="gamma", sample=0.5){
distrib = list()
numfit <- length(fit)
results = matrix(0, ncol=5, nrow=numfit)
for(i in 1:numfit){
if((fit[i] == "gamma") |
(fit[i] == "poisson") |
(fit[i] == "weibull") |
(fit[i] == "exponential") |
(fit[i] == "logistic") |
(fit[i] == "normal") |
(fit[i] == "geometric")
)
distrib[[i]] = fit[i]
else stop("Provide a valid distribution to fit data" )
}
# take a sample of dataset
n = round(length(data)*sample)
data = sample(data, size=n, replace=F)
for(i in 1:numfit) {
if(distrib[[i]] == "gamma") {
gf_shape = "gamma"
fd_g <- fitdistr(data, "gamma")
est_shape = fd_g$estimate[[1]]
est_rate = fd_g$estimate[[2]]
ks = ks.test(data, "pgamma", shape=est_shape, rate=est_rate)
# add to results
results[i,] = c(gf_shape, est_shape, est_rate, ks$statistic, ks$p.value)
}
else if(distrib[[i]] == "poisson"){
gf_shape = "poisson"
fd_p <- fitdistr(data, "poisson")
est_lambda = fd_p$estimate[[1]]
ks = ks.test(data, "ppois", lambda=est_lambda)
# add to results
results[i,] = c(gf_shape, est_lambda, "NA", ks$statistic, ks$p.value)
}
else if(distrib[[i]] == "weibull"){
gf_shape = "weibull"
fd_w <- fitdistr(data,densfun=dweibull,start=list(scale=1,shape=2))
est_shape = fd_w$estimate[[1]]
est_scale = fd_w$estimate[[2]]
ks = ks.test(data, "pweibull", shape=est_shape, scale=est_scale)
# add to results
results[i,] = c(gf_shape, est_shape, est_scale, ks$statistic, ks$p.value)
}
else if(distrib[[i]] == "normal"){
gf_shape = "normal"
fd_n <- fitdistr(data, "normal")
est_mean = fd_n$estimate[[1]]
est_sd = fd_n$estimate[[2]]
ks = ks.test(data, "pnorm", mean=est_mean, sd=est_sd)
# add to results
results[i,] = c(gf_shape, est_mean, est_sd, ks$statistic, ks$p.value)
}
else if(distrib[[i]] == "exponential"){
gf_shape = "exponential"
fd_e <- fitdistr(data, "exponential")
est_rate = fd_e$estimate[[1]]
ks = ks.test(data, "pexp", rate=est_rate)
# add to results
results[i,] = c(gf_shape, est_rate, "NA", ks$statistic, ks$p.value)
}
else if(distrib[[i]] == "logistic"){
gf_shape = "logistic"
fd_l <- fitdistr(data, "logistic")
est_location = fd_l$estimate[[1]]
est_scale = fd_l$estimate[[2]]
ks = ks.test(data, "plogis", location=est_location, scale=est_scale)
# add to results
results[i,] = c(gf_shape, est_location, est_scale, ks$statistic, ks$p.value)
}
}
results = rbind(c("distribution", "param1", "param2", "ks stat", "ks pvalue"), results)
#print(results)
return(results)
}
Applied to your example:
library(MASS)
set.seed(1)
testData <- rnorm(1000)
res = fitData(testData, fit=c("logistic","normal","exponential","poisson"),
sample=1)
res
You do not reject the null hypothesis for the Normal.
Reference: https://web.archive.org/web/20150407031710/http://worldofpiggy.com:80/2014/02/25/automatic-distribution-fitting-r/
I consider the error is mainly because of your data. As seen in the error message, NaN
is created so that the function seems to fail to obtain the score (by differentiating the density function). [range of density function is non-negative, isn't?]
Method of moments, which is simpler, is used instead of maximum likelihood estimation and it produces parameter estimates in spite of a warning.
library(fitdistrplus)
require(MASS)
set.seed(1)
testData <- rnorm(1000)
fitdist(testData, "gamma", method = "mme", start = list(shape = 0.1, rate = 0.1))
Fitting of the distribution ' gamma ' by matching moments
Parameters:
estimate
shape 0.0001268054
rate -0.0108863200
Warning message:
In dgamma(c(-0.626453810742332, 0.183643324222082, -0.835628612410047, :
NaNs produced
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With