Creating new distributions in scipy

Tags:

I'm trying to create a distribution based on some data I have, then draw randomly from that distribution. Here's what I have:

from scipy import stats
import numpy

def getDistribution(data):
    kernel = stats.gaussian_kde(data)
    class rv(stats.rv_continuous):
        def _cdf(self, x):
            return kernel.integrate_box_1d(-numpy.Inf, x)
    return rv()

if __name__ == "__main__":
    # pretend this is real data
    data = numpy.concatenate((numpy.random.normal(2,5,100), numpy.random.normal(25,5,100)))
    d = getDistribution(data)

    print d.rvs(size=100) # this usually fails

I think this is doing what I want it to, but I frequently get an error (see below) when I try to do d.rvs(), and d.rvs(100) never works. Am I doing something wrong? Is there an easier or better way to do this? If it's a bug in scipy, is there some way to get around it?

Finally, is there more documentation on creating custom distributions somewhere? The best I've found is the scipy.stats.rv_continuous documentation, which is pretty spartan and contains no useful examples.

The traceback:

Traceback (most recent call last): File "testDistributions.py", line 19, in print d.rvs(size=100) File "/usr/local/lib/python2.6/dist-packages/scipy-0.10.0-py2.6-linux-x86_64.egg/scipy/stats/distributions.py", line 696, in rvs vals = self._rvs(*args) File "/usr/local/lib/python2.6/dist-packages/scipy-0.10.0-py2.6-linux-x86_64.egg/scipy/stats/distributions.py", line 1193, in _rvs Y = self._ppf(U,*args) File "/usr/local/lib/python2.6/dist-packages/scipy-0.10.0-py2.6-linux-x86_64.egg/scipy/stats/distributions.py", line 1212, in _ppf return self.vecfunc(q,*args) File "/usr/local/lib/python2.6/dist-packages/numpy-1.6.1-py2.6-linux-x86_64.egg/numpy/lib/function_base.py", line 1862, in call theout = self.thefunc(*newargs) File "/usr/local/lib/python2.6/dist-packages/scipy-0.10.0-py2.6-linux-x86_64.egg/scipy/stats/distributions.py", line 1158, in _ppf_single_call return optimize.brentq(self._ppf_to_solve, self.xa, self.xb, args=(q,)+args, xtol=self.xtol) File "/usr/local/lib/python2.6/dist-packages/scipy-0.10.0-py2.6-linux-x86_64.egg/scipy/optimize/zeros.py", line 366, in brentq r = _zeros._brentq(f,a,b,xtol,maxiter,args,full_output,disp) ValueError: f(a) and f(b) must have different signs

Edit

For those curious, following the advice in the answer below, here's code that works:

from scipy import stats
import numpy

def getDistribution(data):
    kernel = stats.gaussian_kde(data)
    class rv(stats.rv_continuous):
        def _rvs(self, *x, **y):
            # don't ask me why it's using self._size 
            # nor why I have to cast to int
            return kernel.resample(int(self._size)) 
        def _cdf(self, x):
            return kernel.integrate_box_1d(-numpy.Inf, x)
        def _pdf(self, x):
            return kernel.evaluate(x)
    return rv(name='kdedist', xa=-200, xb=200)

424

asked May 21 '12 01:05

Noah

1 Answers

Specifically to your traceback:

rvs uses the inverse of the cdf, ppf, to create random numbers. Since you are not specifying ppf, it is calculated by a rootfinding algorithm, brentq. brentq uses lower and upper bounds on where it should search for the value at with the function is zero (find x such that cdf(x)=q, q is quantile).

The default for the limits, xa and xb, are too small in your example. The following works for me with scipy 0.9.0, xa, xb can be set when creating the function instance

def getDistribution(data):
    kernel = stats.gaussian_kde(data)
    class rv(stats.rv_continuous):
        def _cdf(self, x):
            return kernel.integrate_box_1d(-numpy.Inf, x)
    return rv(name='kdedist', xa=-200, xb=200)

There is currently a pull request for scipy to improve this, so in the next release xa and xb will be expanded automatically to avoid the f(a) and f(b) must have different signs exception.

There is not much documentation on this, the easiest is to follow some examples (and ask on the mailing list).

edit: addition

pdf: Since you have the density function also given by gaussian_kde, I would add the _pdf method, which will make some calculations more efficient.

edit2: addition

rvs: If you are interested in generating random numbers, then gaussian_kde has a resample method. Random Samples can be generated by sampling from the data and adding gaussian noise. So, this will be faster than the generic rvs using the ppf method. I would write a ._rvs method that just calls gaussian_kde's resample method.

precomputing ppf: I don't know of any general way to precompute the ppf. However, the way I thought of doing it (but never tried so far) is to precompute the ppf at many points and then use linear interpolation to approximate the ppf function.

edit3: about _rvs to answer Srivatsan's question in the comment

_rvs is the distribution specific method that is called by the public method rvs. rvs is a generic method that does some argument checking, adds location and scale, and sets the attribute self._size which is the size of the requested array of random variables, and then calls the distribution specific method ._rvs or it's generic counterpart. The extra arguments in ._rvs are shape parameters, but since there are none in this case, *x and **y are redundant and unused.

I don't know how well the size or shape of the .rvs method works in the multivariate case. These distributions are designed for univariate distributions, and might not fully work for the multivariate case, or might need some reshapes.

139

answered Oct 21 '22 08:10

Josef

Related questions
                            
                                In Python, what is a preferred naming convention for the backing store of an internal property?
                            
                                Adding a font for use in ReportLab
                            
                                Which logging library to use for cross-language (Java, C++, Python) system
                            
                                Fastest implementation to do multiple string substitutions in Python
                            
                                Create an html formatted report
                            
                                Solving functional equations programmatically
                            
                                Embed python/dsl for scripting in an PHP web application
                            
                                Writing a file with vim doesn't fire a file change event on OS X
                            
                                How to restrict certain paths in NetworkX graphs?
                            
                                Proper way to modify a response in Flask with eg process_response
                            
                                Keeping the WebSocket connection alive
                            
                                Alternative to scipy and numpy for linear algebra?
                            
                                Programmatically generate methods for a class
                            
                                Specific shuffling list in Python
                            
                                How can python subprocess.Popen see select.poll and then later not? (select 'module' object has no attribute 'poll')
                            
                                pexpect can't pass input over 1024 chars?
                            
                                Speeding up linear interpolation of many pixel locations in NumPy
                            
                                Mocking a class's construction
                            
                                Embedding matplotlib in C++
                            
                                Python PIL jpeg quality

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Creating new distributions in scipy

Tags:

python

numpy

scipy

Noah

People also ask

1 Answers

Josef

Recent Activity

Donate For Us