Given a set of data values, I'm trying to get the best theoretical distribution that describes the data well. I came up with the following python code after days of research.
import numpy as np
import csv
import pandas as pd
import scipy.stats as st
import math
import sys
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
def fit_to_all_distributions(data):
dist_names = ['fatiguelife', 'invgauss', 'johnsonsu', 'johnsonsb', 'lognorm', 'norminvgauss', 'powerlognorm', 'exponweib','genextreme', 'pareto']
params = {}
for dist_name in dist_names:
try:
dist = getattr(st, dist_name)
param = dist.fit(data)
params[dist_name] = param
except Exception:
print("Error occurred in fitting")
params[dist_name] = "Error"
return params
def get_best_distribution_using_chisquared_test(data, params):
histo, bin_edges = np.histogram(data, bins='auto', normed=False)
number_of_bins = len(bin_edges) - 1
observed_values = histo
dist_names = ['fatiguelife', 'invgauss', 'johnsonsu', 'johnsonsb', 'lognorm', 'norminvgauss', 'powerlognorm', 'exponweib','genextreme', 'pareto']
dist_results = []
for dist_name in dist_names:
param = params[dist_name]
if (param != "Error"):
# Applying the SSE test
arg = param[:-2]
loc = param[-2]
scale = param[-1]
cdf = getattr(st, dist_name).cdf(bin_edges, loc=loc, scale=scale, *arg)
expected_values = len(data) * np.diff(cdf)
c , p = st.chisquare(observed_values, expected_values, ddof=number_of_bins-len(param))
dist_results.append([dist_name, c, p])
# select the best fitted distribution
best_dist, best_c, best_p = None, sys.maxsize, 0
for item in dist_results:
name = item[0]
c = item[1]
p = item[2]
if (not math.isnan(c)):
if (c < best_c):
best_c = c
best_dist = name
best_p = p
# print the name of the best fit and its p value
print("Best fitting distribution: " + str(best_dist))
print("Best c value: " + str(best_c))
print("Best p value: " + str(best_p))
print("Parameters for the best fit: " + str(params[best_dist]))
return best_dist, best_c, params[best_dist], dist_results
Then I test this code by,
a, m = 3., 2.
values = (np.random.pareto(a, 1000) + 1) * m
data = pd.Series(values)
params = fit_to_all_distributions(data)
best_dist_chi, best_chi, params_chi, dist_results_chi = get_best_distribution_using_chisquared_test(values, params)
Since the data points are generated using Pareto distribution, it should return pareto as the best fitting distribution with a sufficiently large p value (p>0.05).
But this is what I get as output.
Best fitting distribution: genextreme
Best c value: 106.46087793622216
Best p value: 7.626303538461713e-24
Parameters for the best fit: (-0.7664124294696955, 2.3217378846757164, 0.3711562696710188)
Is there anything wrong with my implementation of Chi Squared goodness of fit test?
If you want to know the "goodness of fit", use the R squared stat. R squared tells you how much of the observed variance in the outcome is explained by the input. Here is an example in python. This returns 0.801 , so 80.1% percent of the variance in y seems to be explained by x.
The Chi-Square Test for Normality allows us to check whether or not a model or theory follows an approximately normal distribution. The Chi-Square Test for Normality is not as powerful as other more specific tests (like Lilliefors).
Chi-Square Goodness of Fit Test for the Poisson Distribution. The chi-square goodness of fit test can evaluate a sample and see if it follows the Poisson distribution. The Poisson distribution is a discrete probability distribution that can model counts of events or attributes in a fixed observation space.
Python chi square goodness of fit test (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html) mentions that "“Delta degrees of freedom”: adjustment to the degrees of freedom for the p-value. The p-value is computed using a chi-squared distribution with k - 1 - ddof degrees of freedom, where k is the number of observed frequencies. The default value of ddof is 0."
Hence your code should be corrected as follows.
c , p = st.chisquare(observed_values, expected_values, ddof=len(param))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With