Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Gaussian Kernel Density Estimation (KDE) of large numbers in Python

I have 1000 large numbers, randomly distributed in range 37231 to 56661.

I am trying to use the stats.gaussian_kde but something does not work. (maybe because of my poor knowledge of statistics?).

Here is the code:

from scipy import stats.gaussian_kde
import matplotlib.pyplot as plt

# 'data' is a 1D array that contains the initial numbers 37231 to 56661
xmin = min(data)
xmax = max(data)   

# get evenly distributed numbers for X axis.
x = linspace(xmin, xmax, 1000)   # get 1000 points on x axis
nPoints = len(x)

# get actual kernel density.
density = gaussian_kde(data)
y = density(x)

# print the output data
for i in range(nPoints):
    print "%s   %s" % (x[i], y[i])

plt.plot(x, density(x))
plt.show()

In the printout, I get x values in the column 1, and zeros in the column 2. The plot shows a flat line.

I simply can not find the solution. I tried a very wide range of X-es, the same result.

What is the problem? What am I doing wrong? Could the large numbers be the cause?

like image 333
Proteos Avatar asked Dec 22 '22 00:12

Proteos


2 Answers

I think what's happening is that your data array is made up of integers, which leads to problems:

>>> import numpy, scipy.stats
>>> 
>>> data = numpy.random.randint(37231, 56661,size=10)
>>> xmin, xmax = min(data), max(data)
>>> x = numpy.linspace(xmin, xmax, 10)
>>> 
>>> density = scipy.stats.gaussian_kde(data)
>>> density.dataset
array([[52605, 45451, 46029, 40379, 48885, 41262, 39248, 38247, 55987,
        44019]])
>>> density(x)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

but if we use floats:

>>> density = scipy.stats.gaussian_kde(data*1.0)
>>> density.dataset
array([[ 52605.,  45451.,  46029.,  40379.,  48885.,  41262.,  39248.,
         38247.,  55987.,  44019.]])
>>> density(x)
array([  4.42201513e-05,   5.51130237e-05,   5.94470211e-05,
         5.78485526e-05,   5.21379448e-05,   4.43176188e-05,
         3.66725694e-05,   3.06297511e-05,   2.56191024e-05,
         2.01305127e-05])
like image 55
DSM Avatar answered Dec 26 '22 10:12

DSM


I have made a function to do this. You can vary the bandwidth as a parameter of the function. That is, smaller number = more pointy, larger number = smoother. The default is 0.3.

It works in IPython notebook --pylab=inline

The number of bins is optimized and coded so will vary on the number of variables in your data.

import scipy.stats as stats
import matplotlib.pyplot as plt
import numpy as np

def hist_with_kde(data, bandwidth = 0.3):
    #set number of bins using Freedman and Diaconis
    q1 = np.percentile(data,25)
    q3 = np.percentile(data,75)


    n = len(data)**(.1/.3)
    rng = max(data) - min(data)
    iqr = 2*(q3-q1)
    bins = int((n*rng)/iqr)

    x = np.linspace(min(data),max(data),200)

    kde = stats.gaussian_kde(data)
    kde.covariance_factor = lambda : bandwidth
    kde._compute_covariance()

    plt.plot(x,kde(x),'r') # distribution function
    plt.hist(data,bins=bins,normed=True) # histogram

data = np.random.randn(500)
hist_with_kde(data,0.25)
like image 42
John Avatar answered Dec 26 '22 10:12

John