Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I fit distributions to sample data in R?

I've been struggling with fitting a distribution to sample data I have in R. I've looked at using the fitdist as well as fitdistr functions, but I seem to be running into problems with both.

A quick background; the output of my code should be the most fitting distribution (from a list of distributions) to the data provided, with parameters. This needs to happen without human interaction, so comparing graphs is not an option. I was thinking that I could fit each distribution to the data, draw the p-value from the chi-squared test and find the distribution with the highest p-value. I've gotten some success in a normal distribution to the sample data, but as soon as I try to fit something more complex (a gamma distribution, as seen in the code), I get all kinds of errors. What am I doing wrong?

library(fitdistrplus) 
require(MASS) 
set.seed(1) 
testData <- rnorm(1000) 
distlist <- c("norm","unif","exp")

(z <- fitdist(testData,"gamma",start=list(rate=0.1),fix.arg=list(shape=4)))

Examples of errors I get are:

[1] "Error in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data, : \n initial value in 'vmmin' is not finite\n" attr(,"class")

and

Error in fitdist(testData, "gamma", start = list(rate = 0.1), fix.arg = list(shape = 4)) : the function mle failed to estimate the parameters, with the error code 100

I know I'm probably implementing the fitdist function incorrectly, but I can't seem to find simple examples I can adapt to achieve my code objectives. Can anyone help?

like image 473
Martin Avatar asked Apr 24 '15 19:04

Martin


People also ask

How do you fit a distribution to data in R?

We can identify 4 steps in fitting distributions: 1) Model/function choice: hypothesize families of distributions; 2) Estimate parameters; 3) Evaluate quality of fit; 4) Goodness of fit statistical tests.

How do you fit a distribution into data?

Using Probability Plots to Identify the Distribution of Your Data. Probability plots might be the best way to determine whether your data follow a particular distribution. If your data follow the straight line on the graph, the distribution fits your data. This process is simple to do visually.

How do you fit a Poisson distribution to data in R?

To plot the probability mass function for a Poisson distribution in R, we can use the following functions: dpois(x, lambda) to create the probability mass function. plot(x, y, type = 'h') to plot the probability mass function, specifying the plot to be a histogram (type='h')


2 Answers

You are looking for the Kolmogorov-Smirnov test. Null hypothesis is that the data sample is from the hypothesised distribution.

fitData <- function(data, fit="gamma", sample=0.5){
 distrib = list()
 numfit <- length(fit)
 results = matrix(0, ncol=5, nrow=numfit)

 for(i in 1:numfit){
if((fit[i] == "gamma") | 
     (fit[i] == "poisson") | 
     (fit[i] == "weibull") | 
     (fit[i] == "exponential") |
     (fit[i] == "logistic") |
     (fit[i] == "normal") | 
     (fit[i] == "geometric")
) 
  distrib[[i]] = fit[i]
else stop("Provide a valid distribution to fit data" )
 }

 # take a sample of dataset
 n = round(length(data)*sample)
 data = sample(data, size=n, replace=F)

 for(i in 1:numfit) {
  if(distrib[[i]] == "gamma") {
  gf_shape = "gamma"
  fd_g <- fitdistr(data, "gamma")
  est_shape = fd_g$estimate[[1]]
  est_rate = fd_g$estimate[[2]]

  ks = ks.test(data, "pgamma", shape=est_shape, rate=est_rate)

  # add to results
  results[i,] = c(gf_shape, est_shape, est_rate, ks$statistic, ks$p.value)
}

else if(distrib[[i]] == "poisson"){
  gf_shape = "poisson"
  fd_p <- fitdistr(data, "poisson")
  est_lambda = fd_p$estimate[[1]]

  ks = ks.test(data, "ppois", lambda=est_lambda)
  # add to results
  results[i,] = c(gf_shape, est_lambda, "NA", ks$statistic, ks$p.value)
}

else if(distrib[[i]] == "weibull"){
  gf_shape = "weibull"
  fd_w <- fitdistr(data,densfun=dweibull,start=list(scale=1,shape=2))
  est_shape = fd_w$estimate[[1]]
  est_scale = fd_w$estimate[[2]]

  ks = ks.test(data, "pweibull", shape=est_shape, scale=est_scale)
  # add to results
  results[i,] = c(gf_shape, est_shape, est_scale, ks$statistic, ks$p.value) 
}

else if(distrib[[i]] == "normal"){
  gf_shape = "normal"
  fd_n <- fitdistr(data, "normal")
  est_mean = fd_n$estimate[[1]]
  est_sd = fd_n$estimate[[2]]

  ks = ks.test(data, "pnorm", mean=est_mean, sd=est_sd)
  # add to results
  results[i,] = c(gf_shape, est_mean, est_sd, ks$statistic, ks$p.value)
}

else if(distrib[[i]] == "exponential"){
  gf_shape = "exponential"
  fd_e <- fitdistr(data, "exponential")
  est_rate = fd_e$estimate[[1]]
  ks = ks.test(data, "pexp", rate=est_rate)
  # add to results
  results[i,] = c(gf_shape, est_rate, "NA", ks$statistic, ks$p.value)
}

else if(distrib[[i]] == "logistic"){
  gf_shape = "logistic"
  fd_l <- fitdistr(data, "logistic")
  est_location = fd_l$estimate[[1]]
  est_scale = fd_l$estimate[[2]]
  ks = ks.test(data, "plogis", location=est_location, scale=est_scale)
  # add to results
  results[i,] = c(gf_shape, est_location, est_scale, ks$statistic,    ks$p.value) 
    }
  }
  results = rbind(c("distribution", "param1", "param2", "ks stat", "ks    pvalue"),   results)
  #print(results)
  return(results)
  }

Applied to your example:

library(MASS)
set.seed(1) 
testData <- rnorm(1000) 
res = fitData(testData, fit=c("logistic","normal","exponential","poisson"),
    sample=1)
res

You do not reject the null hypothesis for the Normal.

Reference: https://web.archive.org/web/20150407031710/http://worldofpiggy.com:80/2014/02/25/automatic-distribution-fitting-r/

like image 88
TinaW Avatar answered Oct 20 '22 17:10

TinaW


I consider the error is mainly because of your data. As seen in the error message, NaN is created so that the function seems to fail to obtain the score (by differentiating the density function). [range of density function is non-negative, isn't?]

Method of moments, which is simpler, is used instead of maximum likelihood estimation and it produces parameter estimates in spite of a warning.

library(fitdistrplus) 
require(MASS) 
set.seed(1) 
testData <- rnorm(1000) 
fitdist(testData, "gamma", method = "mme", start = list(shape = 0.1, rate = 0.1))

Fitting of the distribution ' gamma ' by matching moments 
Parameters:
           estimate
shape  0.0001268054
rate  -0.0108863200
Warning message:
In dgamma(c(-0.626453810742332, 0.183643324222082, -0.835628612410047,  :
  NaNs produced
like image 21
Jaehyeon Kim Avatar answered Oct 20 '22 19:10

Jaehyeon Kim