Python chi square goodness of fit test to get the best distribution

Tags:

Given a set of data values, I'm trying to get the best theoretical distribution that describes the data well. I came up with the following python code after days of research.

import numpy as np
import csv
import pandas as pd
import scipy.stats as st
import math
import sys
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

def fit_to_all_distributions(data):
    dist_names = ['fatiguelife', 'invgauss', 'johnsonsu', 'johnsonsb', 'lognorm', 'norminvgauss', 'powerlognorm', 'exponweib','genextreme', 'pareto']

    params = {}
    for dist_name in dist_names:
        try:
            dist = getattr(st, dist_name)
            param = dist.fit(data)

            params[dist_name] = param
        except Exception:
            print("Error occurred in fitting")
            params[dist_name] = "Error"

    return params 


def get_best_distribution_using_chisquared_test(data, params):

    histo, bin_edges = np.histogram(data, bins='auto', normed=False)
    number_of_bins = len(bin_edges) - 1
    observed_values = histo

    dist_names = ['fatiguelife', 'invgauss', 'johnsonsu', 'johnsonsb', 'lognorm', 'norminvgauss', 'powerlognorm', 'exponweib','genextreme', 'pareto']

    dist_results = []

    for dist_name in dist_names:

        param = params[dist_name]
        if (param != "Error"):
            # Applying the SSE test
            arg = param[:-2]
            loc = param[-2]
            scale = param[-1]
            cdf = getattr(st, dist_name).cdf(bin_edges, loc=loc, scale=scale, *arg)
            expected_values = len(data) * np.diff(cdf)
            c , p = st.chisquare(observed_values, expected_values, ddof=number_of_bins-len(param))
            dist_results.append([dist_name, c, p])


    # select the best fitted distribution
    best_dist, best_c, best_p = None, sys.maxsize, 0

    for item in dist_results:
        name = item[0]
        c = item[1]
        p = item[2]
        if (not math.isnan(c)):
            if (c < best_c):
                best_c = c
                best_dist = name
                best_p = p

    # print the name of the best fit and its p value

    print("Best fitting distribution: " + str(best_dist))
    print("Best c value: " + str(best_c))
    print("Best p value: " + str(best_p))
    print("Parameters for the best fit: " + str(params[best_dist]))

    return best_dist, best_c, params[best_dist], dist_results

Then I test this code by,

a, m = 3., 2.
values = (np.random.pareto(a, 1000) + 1) * m
data = pd.Series(values)
params = fit_to_all_distributions(data)
best_dist_chi, best_chi, params_chi, dist_results_chi = get_best_distribution_using_chisquared_test(values, params)

Since the data points are generated using Pareto distribution, it should return pareto as the best fitting distribution with a sufficiently large p value (p>0.05).

But this is what I get as output.

Best fitting distribution: genextreme
Best c value: 106.46087793622216
Best p value: 7.626303538461713e-24
Parameters for the best fit: (-0.7664124294696955, 2.3217378846757164, 0.3711562696710188)

Is there anything wrong with my implementation of Chi Squared goodness of fit test?

368

asked Aug 17 '18 11:08

Pasindu Tennage

1 Answers

Python chi square goodness of fit test (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html) mentions that "“Delta degrees of freedom”: adjustment to the degrees of freedom for the p-value. The p-value is computed using a chi-squared distribution with k - 1 - ddof degrees of freedom, where k is the number of observed frequencies. The default value of ddof is 0."

Hence your code should be corrected as follows.

c , p = st.chisquare(observed_values, expected_values, ddof=len(param))

answered Oct 13 '22 17:10

Pasindu Tennage

Related questions
                            
                                Concatening an attention layer with decoder input seq2seq model on Keras
                            
                                When ruamel.yaml loads @dataclass from string, __post_init__ is not called
                            
                                SQLAlchemy get Mapper object from Table object (from Metadata or Session or otherwise)
                            
                                Multiple mkl packages installed in anaconda
                            
                                Given an integer, how big is its varint encoding?
                            
                                NodeJS implementation for Python's pbkdf2_sha256.verify
                            
                                Bag of Words (BOW) vs N-gram (sklearn CountVectorizer) - text documents classification
                            
                                How to use tf.data's initializable iterator and reinitializable interator and feed data to estimator api?
                            
                                Named accumulator in pyspark
                            
                                Python - open a website and send cookies
                            
                                How to sort a NumPy array by frequency?
                            
                                Pytest: deleting files created by the tested function
                            
                                How to compare equality of dataclasses holding numpy.ndarray (bool(a==b) raises ValueError)?
                            
                                Using proper file structure with SQLAlchemy and how to add data to db
                            
                                Deleting main object by deleting reference
                            
                                Can I patch Python's assert to get the output that py.test provides?
                            
                                Why does "xticks" work on the figure but "set_xticks" not on the axis?
                            
                                How to create a table with clickable hyperlink to a local file in pandas & Jupyter Notebook
                            
                                Numba - does nopython mode support list of tuples?
                            
                                How to convert a nested list into a one-dimensional list in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python chi square goodness of fit test to get the best distribution

Tags:

python

statistics

scipy

Pasindu Tennage

People also ask

1 Answers

Pasindu Tennage

Recent Activity

Donate For Us