How do I fit distributions to sample data in R?

Tags:

curve-fitting

I've been struggling with fitting a distribution to sample data I have in R. I've looked at using the fitdist as well as fitdistr functions, but I seem to be running into problems with both.

A quick background; the output of my code should be the most fitting distribution (from a list of distributions) to the data provided, with parameters. This needs to happen without human interaction, so comparing graphs is not an option. I was thinking that I could fit each distribution to the data, draw the p-value from the chi-squared test and find the distribution with the highest p-value. I've gotten some success in a normal distribution to the sample data, but as soon as I try to fit something more complex (a gamma distribution, as seen in the code), I get all kinds of errors. What am I doing wrong?

library(fitdistrplus) 
require(MASS) 
set.seed(1) 
testData <- rnorm(1000) 
distlist <- c("norm","unif","exp")

(z <- fitdist(testData,"gamma",start=list(rate=0.1),fix.arg=list(shape=4)))

Examples of errors I get are:

[1] "Error in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data, : \n initial value in 'vmmin' is not finite\n" attr(,"class")

and

Error in fitdist(testData, "gamma", start = list(rate = 0.1), fix.arg = list(shape = 4)) : the function mle failed to estimate the parameters, with the error code 100

I know I'm probably implementing the fitdist function incorrectly, but I can't seem to find simple examples I can adapt to achieve my code objectives. Can anyone help?

473

asked Apr 24 '15 19:04

Martin

2 Answers

You are looking for the Kolmogorov-Smirnov test. Null hypothesis is that the data sample is from the hypothesised distribution.

fitData <- function(data, fit="gamma", sample=0.5){
 distrib = list()
 numfit <- length(fit)
 results = matrix(0, ncol=5, nrow=numfit)

 for(i in 1:numfit){
if((fit[i] == "gamma") | 
     (fit[i] == "poisson") | 
     (fit[i] == "weibull") | 
     (fit[i] == "exponential") |
     (fit[i] == "logistic") |
     (fit[i] == "normal") | 
     (fit[i] == "geometric")
) 
  distrib[[i]] = fit[i]
else stop("Provide a valid distribution to fit data" )
 }

 # take a sample of dataset
 n = round(length(data)*sample)
 data = sample(data, size=n, replace=F)

 for(i in 1:numfit) {
  if(distrib[[i]] == "gamma") {
  gf_shape = "gamma"
  fd_g <- fitdistr(data, "gamma")
  est_shape = fd_g$estimate[[1]]
  est_rate = fd_g$estimate[[2]]

  ks = ks.test(data, "pgamma", shape=est_shape, rate=est_rate)

  # add to results
  results[i,] = c(gf_shape, est_shape, est_rate, ks$statistic, ks$p.value)
}

else if(distrib[[i]] == "poisson"){
  gf_shape = "poisson"
  fd_p <- fitdistr(data, "poisson")
  est_lambda = fd_p$estimate[[1]]

  ks = ks.test(data, "ppois", lambda=est_lambda)
  # add to results
  results[i,] = c(gf_shape, est_lambda, "NA", ks$statistic, ks$p.value)
}

else if(distrib[[i]] == "weibull"){
  gf_shape = "weibull"
  fd_w <- fitdistr(data,densfun=dweibull,start=list(scale=1,shape=2))
  est_shape = fd_w$estimate[[1]]
  est_scale = fd_w$estimate[[2]]

  ks = ks.test(data, "pweibull", shape=est_shape, scale=est_scale)
  # add to results
  results[i,] = c(gf_shape, est_shape, est_scale, ks$statistic, ks$p.value) 
}

else if(distrib[[i]] == "normal"){
  gf_shape = "normal"
  fd_n <- fitdistr(data, "normal")
  est_mean = fd_n$estimate[[1]]
  est_sd = fd_n$estimate[[2]]

  ks = ks.test(data, "pnorm", mean=est_mean, sd=est_sd)
  # add to results
  results[i,] = c(gf_shape, est_mean, est_sd, ks$statistic, ks$p.value)
}

else if(distrib[[i]] == "exponential"){
  gf_shape = "exponential"
  fd_e <- fitdistr(data, "exponential")
  est_rate = fd_e$estimate[[1]]
  ks = ks.test(data, "pexp", rate=est_rate)
  # add to results
  results[i,] = c(gf_shape, est_rate, "NA", ks$statistic, ks$p.value)
}

else if(distrib[[i]] == "logistic"){
  gf_shape = "logistic"
  fd_l <- fitdistr(data, "logistic")
  est_location = fd_l$estimate[[1]]
  est_scale = fd_l$estimate[[2]]
  ks = ks.test(data, "plogis", location=est_location, scale=est_scale)
  # add to results
  results[i,] = c(gf_shape, est_location, est_scale, ks$statistic,    ks$p.value) 
    }
  }
  results = rbind(c("distribution", "param1", "param2", "ks stat", "ks    pvalue"),   results)
  #print(results)
  return(results)
  }

Applied to your example:

library(MASS)
set.seed(1) 
testData <- rnorm(1000) 
res = fitData(testData, fit=c("logistic","normal","exponential","poisson"),
    sample=1)
res

You do not reject the null hypothesis for the Normal.

Reference: https://web.archive.org/web/20150407031710/http://worldofpiggy.com:80/2014/02/25/automatic-distribution-fitting-r/

answered Oct 20 '22 17:10

TinaW

I consider the error is mainly because of your data. As seen in the error message, NaN is created so that the function seems to fail to obtain the score (by differentiating the density function). [range of density function is non-negative, isn't?]

Method of moments, which is simpler, is used instead of maximum likelihood estimation and it produces parameter estimates in spite of a warning.

library(fitdistrplus) 
require(MASS) 
set.seed(1) 
testData <- rnorm(1000) 
fitdist(testData, "gamma", method = "mme", start = list(shape = 0.1, rate = 0.1))

Fitting of the distribution ' gamma ' by matching moments 
Parameters:
           estimate
shape  0.0001268054
rate  -0.0108863200
Warning message:
In dgamma(c(-0.626453810742332, 0.183643324222082, -0.835628612410047,  :
  NaNs produced

answered Oct 20 '22 19:10

Jaehyeon Kim

Related questions
                            
                                How to combine multiple JSON files into a single file in R
                            
                                Function that returns a function in R
                            
                                Extracting Dates from xts object based on vaule
                            
                                how to pause R and resume it later?
                            
                                Loading intraday data into R for handling it with quantmod
                            
                                How to make custom plot symbols from vector graphics in R
                            
                                Using function result again in j expression
                            
                                How to avoid `all` function returning `TRUE` when comparing to `NULL` or an empty object
                            
                                Splitting the values in column using regex
                            
                                Colors lost in legend when using scale_shape_manual
                            
                                Is there a way in data.table to assign ID's by group based upon an identifier? [duplicate]
                            
                                convert a csv to excel without using xlsx package
                            
                                Merging contents of columns using apply or other vectorized approach
                            
                                Remove the last part of a string after the last "." in R
                            
                                File compression options with ggplot2
                            
                                Calculate quantiles in R without interpolation - round up or down to actual value
                            
                                How to change line properties in ggplot2 halfway in a time series?
                            
                                R: Count objects in column-list
                            
                                Multiplying column value by another value depending on value in certain column R
                            
                                Group categories in R according to first letters of a string?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With