Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I determine a best-fit distribution in java?

I have a bunch of sets of data (between 50 to 500 points, each of which can take a positive integral value) and need to determine which distribution best describes them. I have done this manually for several of them, but need to automate this going forward.

Some of the sets are completely modal (every datum has the value of 15), some are strongly modal or bimodal, some are bell-curves (often skewed and with differing degrees of kertosis/pointiness), some are roughly flat, and there are any number of other possible distributions (possion, power-law, etc.). I need a way to determine which distribution best describes the data and (ideally) also provides me with a fitness metric so that I know how confident I am in the analysis.

Existing open-source libraries would be ideal, followed by well documented algorithms that I can implement myself.

like image 742
Eadwacer Avatar asked Jun 02 '10 21:06

Eadwacer


People also ask

How do you determine the best distribution fit?

You need to combine the p-values for the Anderson-Darling statistic, the LRT, and the AIC value to help determine which data fits the distribution best. Based on the results, it appears that the Weibull and the three parameter Weibull both fit the data pretty well.

How do you determine if data is normally distributed?

In order to be considered a normal distribution, a data set (when graphed) must follow a bell-shaped symmetrical curve centered around the mean. It must also adhere to the empirical rule that indicates the percentage of the data set that falls within (plus or minus) 1, 2 and 3 standard deviations of the mean.

How many data points do you need to fit a distribution?

Things you have to watch out for If we choose the parameters α and β to be equal or close, we can accomplish this. And then, if you fit 1000 data points, you may get Normal distribution as the best-fitted distribution.


1 Answers

Looking for a distribution that fits is unlikely to give you good results in the absence of some a priori knowledge. You may find a distribution that coincidentally is a good fit but is unlikely to be the underlying distribution.

Do you have any metadata available that would hint at what the data means? E.g., "this is open-ended data sampled from a natural population, so it's some sort of normal distribution", vs. "this data is inherently bounded at 0 and discrete, so check for the best-fitting Poisson".

I don't know of any distribution solvers for Java off the top of my head, and I don't know of any that will guess which distribution to use. You could examine some statistical properties (skew/etc.) and make some guesses here--but you're more likely to end up with an accidentally good fit which does not adequately represent the underlying distribution. Real data is noisy and there are just too many degrees of freedom if you don't even know what distribution it is.

like image 141
Alex Feinman Avatar answered Sep 28 '22 09:09

Alex Feinman