Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can some explain this strange behavior of the hypergeometric distribution in scipy?

Tags:

python

scipy

I am running Python 2.6.5 on Mac OS X 10.6.4 (this is not the native version, I installed it myself) with Scipy 0.8.0. If I do the following:

>>> from scipy.stats import hypergeom
>>> hypergeom.sf(5,10,2,5)

I get an IndexError. Then I do:

>>> hypergeom.sf(2,10,2,2)
-4.44....

I suspect the negative value is due to bad floating point precision. Then I do the first one again:

>>> hypergeom.sf(5,10,2,5)
0.0

Now it works! Can someone explain this? Are you seeing this behavior too?

like image 757
Björn Pollex Avatar asked Sep 28 '10 12:09

Björn Pollex


1 Answers

The problem seems to arise based if the first call to the survival function is in the range that should obviously be zero (see my comment to the previous answer). E.g., for calls to hypergeom.sf(x,M,n,N) it fails if the first call to a hypergeometric function to the function is a situation where x > n, where the survival function will always be zero.

You could trivially fix this temporarily by:

def new_hypergeom_sf(k, *args, **kwds):
    from scipy.stats import hypergeom
    (M, n, N) = args[0:3]
    try:
        return hypergeom.sf(k, *args, **kwds)
    except Exception as inst:
        if k >= n and type(inst) == IndexError:
            return 0 ## or conversely 1 - hypergeom.cdf(k, *args, **kwds)
        else:
            raise inst

Now if you have no problem editing the /usr/share/pyshared/scipy/stats/distributions.py (or equivalent file), the fix is likely on line 3966 where right now it reads:

    place(output,cond,self._sf(*goodargs))
    if output.ndim == 0:
        return output[()]
    return output

But if you change it to:

    if output.ndim == 0:
        return output[()]
    place(output,cond,self._sf(*goodargs))
    if output.ndim == 0:
        return output[()]
    return output

It now works without the IndexError. Basically if the output is zero dimensional because it fails the checks, it tries to call place, fails, and doesn't generate the distribution. (This doesn't happen if a previous distribution has already been created which is likely why this wasn't caught on earlier tests.) Note that place (defined in numpy's function_base.py) will change elements of the array (though I'm not sure if it changes the dimensionality) so it may be best to still have it leave the 0 dim check after place too. I haven't fully tested this to see if this change breaks anything else (and it applies to all discrete random variable distributions), so it maybe its best to do the first fix.

It does break it; e.g., stats.hypergeom.sf(1,10,2,5) returns as zero (instead of 2/9).

This fix seems to work much better, in the same section:

class rv_discrete(rv_generic):
...
    def sf(self, k, *args, **kwds):
    ...
        if any(cond):
            place(output,cond,self._sf(*goodargs))
        if output.ndim == 0:
            return output[()]
        return output
like image 166
dr jimbob Avatar answered Sep 28 '22 02:09

dr jimbob