Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Kolmogorov-Smirnov or a Chi-Square test for a distribution?

Tags:

r

statistics

I used model fitting to fit the negative binomial distribution to my discrete data. As a final step it looks like I need to perform a Kolmogrov-Smirnov test to determine if the model fits the data well. All the references I could find talk about using the test for normally distributed continuous data. Can someone tell me if this can be done in R for data that is not normally distributed and discrete? (Even a chi-square test should do I'm guessing but please correct me if I'm wrong.)

UPDATE:

So I found that the vcd package contains a function goodfit that can be used for this purpose in the following way:

library(vcd)

# Define the data
data <- c(67, 81, 93, 65, 18, 44, 31, 103, 64, 19, 27, 57, 63, 25, 22, 150,
          31, 58, 93, 6, 86, 43, 17, 9, 78, 23, 75, 28, 37, 23, 108, 14, 137,
          69, 58, 81, 62, 25, 54, 57, 65, 72, 17, 22, 170, 95, 38, 33, 34, 68,
          38, 117, 28, 17, 19, 25, 24, 15, 103, 31, 33, 77, 38, 8, 48, 32, 48,
          26, 63, 16, 70, 87, 31, 36, 31, 38, 91, 117, 16, 40, 7, 26, 15, 89,
          67, 7, 39, 33, 58)

gf <- goodfit(data, type = "nbinomial", method = "MinChisq") 
plot(gf)

But after the gf <- ... step, R complains saying:

Warning messages:
1: In pnbinom(q, size, prob, lower.tail, log.p) : NaNs produced
2: In pnbinom(q, size, prob, lower.tail, log.p) : NaNs produced
3: In pnbinom(q, size, prob, lower.tail, log.p) : NaNs produced

and when I say plot it complains:

Error in xy.coords(x, y, xlabel, ylabel, log) : 
  'x' is a list, but does not have components 'x' and 'y'

I am not sure what is happening because if I set data to be the following:

data <- <- rnbinom(200, size = 1.5, prob = 0.8)

everything works fine. Any suggestions?

like image 818
Legend Avatar asked Dec 02 '10 06:12

Legend


1 Answers

A KS-Test is for continuous variables only, plus you have to fully specify the distribution you are testing against. If you still wanted to do it, it would look like this:

ks.test(data, pnbinom, size=100, prob=0.8)

It compares the empirical cumulative distribution function of data against the specified one (whether that makes sense probably depends on your data). You would have to choose parameters for size and prob based on theoretical considerations, the test is not valid if you estimate those parameters based on the data.

Your problem with goodfit() might have to do with your data, are you sure these are counts? barplot(table(data)) does not look like it's approximately following a negative binomial distribution, compare, e.g., with barplot(table(rnbinom(200, size = 1.5, prob = 0.8)))

Finally, I'm not sure if the approach of doing a null-hypothesis test after fitting is appropriate. You may want to look into quantitative fit measures beyond / based on $\chi^2$ of which there are many (RMSEA, SRMR, ...).

like image 102
caracal Avatar answered Sep 20 '22 13:09

caracal