How to find probability distribution and parameters for real data? (Python 3)

Tags:

I have a dataset from sklearn and I plotted the distribution of the load_diabetes.target data (i.e. the values of the regression that the load_diabetes.data are used to predict).

I used this because it has the fewest number of variables/attributes of the regression sklearn.datasets.

Using Python 3, How can I get the distribution-type and parameters of the distribution this most closely resembles?

All I know the target values are all positive and skewed (positve skew/right skew). . . Is there a way in Python to provide a few distributions and then get the best fit for the target data/vector? OR, to actually suggest a fit based on the data that's given? That would be realllllly useful for people who have theoretical statistical knowledge but little experience with applying it to "real data".

Bonus Would it make sense to use this type of approach to figure out what your posterior distribution would be with "real data" ? If no, why not?

from sklearn.datasets import load_diabetes import matplotlib.pyplot as plt import seaborn as sns; sns.set() import pandas as pd  #Get Data data = load_diabetes() X, y_ = data.data, data.target  #Organize Data SR_y = pd.Series(y_, name="y_ (Target Vector Distribution)")  #Plot Data fig, ax = plt.subplots() sns.distplot(SR_y, bins=25, color="g", ax=ax) plt.show()

enter image description here

688

asked May 27 '16 15:05

O.rka

2 Answers

Use this approach

import scipy.stats as st def get_best_distribution(data):     dist_names = ["norm", "exponweib", "weibull_max", "weibull_min", "pareto", "genextreme"]     dist_results = []     params = {}     for dist_name in dist_names:         dist = getattr(st, dist_name)         param = dist.fit(data)          params[dist_name] = param         # Applying the Kolmogorov-Smirnov test         D, p = st.kstest(data, dist_name, args=param)         print("p value for "+dist_name+" = "+str(p))         dist_results.append((dist_name, p))      # select the best fitted distribution     best_dist, best_p = (max(dist_results, key=lambda item: item[1]))     # store the name of the best fit and its p value      print("Best fitting distribution: "+str(best_dist))     print("Best p value: "+ str(best_p))     print("Parameters for the best fit: "+ str(params[best_dist]))      return best_dist, best_p, params[best_dist]

134

answered Sep 20 '22 13:09

Pasindu Tennage

To the best of my knowledge, there is no automatic way of obtaining the distribution type and parameters of a sample (as inferring the distribution of a sample is a statistical problem by itself).

In my opinion, the best you can do is:

(for each attribute)

Try to fit each attribute to a reasonably large list of possible distributions (e.g. see Fitting empirical distribution to theoretical ones with Scipy (Python)? for an example with Scipy)
Evaluate all your fits and pick the best one. This can be done by performing a Kolmogorov-Smirnov test between your sample and each of the distributions of the fit (you have an implementation in Scipy, again), and picking the one that minimises D, the test statistic (a.k.a. the difference between the sample and the fit).

Bonus: It would make sense - as you'll be building a model on each of the variables as you pick a fit for each one - although the goodness of your prediction would depend on the quality of your data and the distributions you are using for fitting. You are building a model, after all.

answered Sep 21 '22 13:09

carrdelling

Related questions
                            
                                How to install CUDA in Google Colab GPU's
                            
                                How can I check all the installed Python versions on Windows?
                            
                                How to set any font in reportlab Canvas in python?
                            
                                Interactive Python: cannot get `%lprun` to work, although line_profiler is imported properly
                            
                                python pandas loc - filter for list of values [duplicate]
                            
                                How to use virtualenv with python3.6 on ubuntu 16.04?
                            
                                "except Foo as bar" causes "bar" to be removed from scope [duplicate]
                            
                                Multithreading for Python Django
                            
                                How to make a simple multithreaded socket server in Python that remembers clients
                            
                                How to tune parameters in Random Forest, using Scikit Learn?
                            
                                What is the meaning of bind = True keyword in celery?
                            
                                Join float list into space-separated string in Python
                            
                                How to find tag with particular text with Beautiful Soup?
                            
                                howto create db mysql with sqlalchemy
                            
                                Why does upsert a record using update_one raise ValueError?
                            
                                Is there official guide for Python 3.x release lifecycle?
                            
                                Python 3.7 Error: Unsupported Pickle Protocol 5
                            
                                How to set my xlabel at the end of xaxis
                            
                                How to filter model results for multiple values for a many to many field in django
                            
                                TypeError: 'range' object does not support item assignment

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to find probability distribution and parameters for real data? (Python 3)

Tags:

python

machine-learning

statistics

distribution

data-fitting

O.rka

People also ask

2 Answers

Pasindu Tennage

carrdelling

Recent Activity

Donate For Us