I have 1000 large numbers, randomly distributed in range 37231 to 56661.
I am trying to use the stats.gaussian_kde
but something does not work.
(maybe because of my poor knowledge of statistics?).
Here is the code:
from scipy import stats.gaussian_kde
import matplotlib.pyplot as plt
# 'data' is a 1D array that contains the initial numbers 37231 to 56661
xmin = min(data)
xmax = max(data)
# get evenly distributed numbers for X axis.
x = linspace(xmin, xmax, 1000) # get 1000 points on x axis
nPoints = len(x)
# get actual kernel density.
density = gaussian_kde(data)
y = density(x)
# print the output data
for i in range(nPoints):
print "%s %s" % (x[i], y[i])
plt.plot(x, density(x))
plt.show()
In the printout, I get x values in the column 1, and zeros in the column 2. The plot shows a flat line.
I simply can not find the solution. I tried a very wide range of X-es, the same result.
What is the problem? What am I doing wrong? Could the large numbers be the cause?
I think what's happening is that your data array is made up of integers, which leads to problems:
>>> import numpy, scipy.stats
>>>
>>> data = numpy.random.randint(37231, 56661,size=10)
>>> xmin, xmax = min(data), max(data)
>>> x = numpy.linspace(xmin, xmax, 10)
>>>
>>> density = scipy.stats.gaussian_kde(data)
>>> density.dataset
array([[52605, 45451, 46029, 40379, 48885, 41262, 39248, 38247, 55987,
44019]])
>>> density(x)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
but if we use floats:
>>> density = scipy.stats.gaussian_kde(data*1.0)
>>> density.dataset
array([[ 52605., 45451., 46029., 40379., 48885., 41262., 39248.,
38247., 55987., 44019.]])
>>> density(x)
array([ 4.42201513e-05, 5.51130237e-05, 5.94470211e-05,
5.78485526e-05, 5.21379448e-05, 4.43176188e-05,
3.66725694e-05, 3.06297511e-05, 2.56191024e-05,
2.01305127e-05])
I have made a function to do this. You can vary the bandwidth as a parameter of the function. That is, smaller number = more pointy, larger number = smoother. The default is 0.3.
It works in IPython notebook --pylab=inline
The number of bins is optimized and coded so will vary on the number of variables in your data.
import scipy.stats as stats
import matplotlib.pyplot as plt
import numpy as np
def hist_with_kde(data, bandwidth = 0.3):
#set number of bins using Freedman and Diaconis
q1 = np.percentile(data,25)
q3 = np.percentile(data,75)
n = len(data)**(.1/.3)
rng = max(data) - min(data)
iqr = 2*(q3-q1)
bins = int((n*rng)/iqr)
x = np.linspace(min(data),max(data),200)
kde = stats.gaussian_kde(data)
kde.covariance_factor = lambda : bandwidth
kde._compute_covariance()
plt.plot(x,kde(x),'r') # distribution function
plt.hist(data,bins=bins,normed=True) # histogram
data = np.random.randn(500)
hist_with_kde(data,0.25)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With