Fitting data points to a cumulative distribution

Q: How many data points fit a distribution?

If one is sampling off of a mathematically-defined curve (e.g. calculating the value of a function at several points) then two points are enough to completely define a normalized Gaussian distribution, and three points are enough to completely define an unnormalized Gaussian distribution.

Q: How do you fit data into a distribution in Excel?

Setting up the dialog box to fit a distributionSelect the XLSTAT / Modeling data / Distribution fitting command (see below). The Distribution fitting dialog box then appears. Select the data on the Excel sheet named Data. In the General tab, select column B in the Data field.

Tags:

python

numpy

scipy

cdf

probability-density

I am trying to fit a gamma distribution to my data points, and I can do that using code below.

import scipy.stats as ss
import numpy as np
dataPoints = np.arange(0,1000,0.2)
fit_alpha,fit_loc,fit_beta = ss.rv_continuous.fit(ss.gamma, dataPoints, floc=0)

I want to reconstruct a larger distribution using many such small gamma distributions (the larger distribution is irrelevant for the question, only justifying why I am trying to fit a cdf as opposed to a pdf).

To achieve that, I want to fit a cumulative distribution, as opposed to a pdf, to my smaller distribution data.—More precisely, I want to fit the data to only a part of the cumulative distribution.

For example, I want to fit the data only until the cumulative probability function (with a certain scale and shape) reaches 0.6.

Any thoughts on using fit() for this purpose?

661

asked Sep 17 '13 12:09

Sahil M

1 Answers

I understand that you are trying to piecewise reconstruct your cdf with several small gamma distributions each with a different scale and shape parameter capturing the 'local' regions of your distribution.

Probably makes sense if your empirical distribution is multi-modal / difficult to be summarized by one 'global' parametric distribution.

Don't know if you have specific reasons behind specifically fitting several gamma distributions, but in case your goal is to try to fit a distribution which is relatively smooth and captures your empirical cdf well perhaps you can take a look at Kernel Density Estimation. It is essentially a non-parametric way to fit a distribution to your data.

http://scikit-learn.org/stable/modules/density.html http://en.wikipedia.org/wiki/Kernel_density_estimation

For example, you can try out a gaussian kernel and change the bandwidth parameter to control how smooth your fit is. A bandwith which is too small leads to an unsmooth ("overfitted") result [high variance, low bias]. A bandwidth which is too large results in a very smooth result but with high bias.

from sklearn.neighbors.kde import KernelDensity
kde = KernelDensity(kernel='gaussian', bandwidth=0.2).fit(dataPoints)

A good way then to select a bandwidth parameter that balances bias - variance tradeoff is to use cross-validation. Essentially the high level idea is you partition your data, run analysis on the training set and 'validate' on the test set, this will prevent overfitting the data.

Fortunately, sklearn also implements a nice example of choosing the best bandwidth of a Guassian Kernel using Cross Validation which you can borrow some code from:

http://scikit-learn.org/stable/auto_examples/neighbors/plot_digits_kde_sampling.html

Hope this helps!

answered Oct 06 '22 00:10

Azmy Rajab

Related questions
                            
                                Drawing hatch filled rectangles with differing angles in python
                            
                                Best tool to generate STL (3-D geometry file) algorithmically? [closed]
                            
                                Efficient way to find the min and max value of a changing set in python
                            
                                BeautifulSoup get all the values of a particular column
                            
                                Prevent touching db during unit testing with SQLAlchemy
                            
                                Using concurrent.futures.Future with greenlets/gevent
                            
                                boolean indexing from a subset of a list in python
                            
                                Django models queries use join
                            
                                Emacs org-mode python blocks have 5 space tabs, but I want 4 space tabs
                            
                                Filter Pandas DataFrame using another DataFrame
                            
                                How to get site name in django template?
                            
                                Pandas: why does DataFrame.apply(f, axis=1) call f when the DataFrame is empty?
                            
                                How to logically combine integer indices in numpy?
                            
                                Argparse -- add optional arguments in help string
                            
                                Google Cloud Endpoints Android Client - Auth error
                            
                                How to delete an image file from GridFS by file metadata?
                            
                                openpyxl: writing large excel files with python
                            
                                Check if file is json loadable
                            
                                Dynamically resizing a kivy label (and button) on the python side
                            
                                Can you compile a .py file with cython instead of a pyx file?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With