In python 3.7, I have this numpy array with shape=(2, 34900). This arrays is a list of coordinates where the index 0 represents the X axis and the index 1 represents the y axis.
When I use seaborn.kde_plot() to make a visualization of the distribution of this data, I'm able to get the result in about 5-15 seconds when running on a i5 7th generation.
But when I try to run the following piece of code:
#Find the kernel for
k = scipy.stats.kde.gaussian_kde(data, bw_method=.3)
#Define the grid
xi, yi = np.mgrid[0:1:2000*1j, 0:1:2000*1j]
#apply the function
zi = k(np.vstack([xi.flatten(), yi.flatten()]))
which finds the gaussian kernel for this data and applies it to a grid I defined, it takes much more time. I wasn't able to run the full array but when running on a slice with the size of 140, it takes about 40 seconds to complete.
The 140 sized slice does make an interesting result which I was able to visualize using plt.pcolormesh()
.
My question is what I am missing here. If I understood what is happening correctly, I'm using the scipy.stats.kde.gaussian_kde()
to create an estimation of a function defined by the data. Then I'm applying the function to a 2D space and getting it's Z component as result. Then I'm plotting the Z component. But how can this process be any different from
seaborn.kde_plot()
that makes the code take so much longer.
Scipy's implementation just goes through each point doing this:
for i in range(self.n):
diff = self.dataset[:, i, newaxis] - points
tdiff = dot(self.inv_cov, diff)
energy = sum(diff*tdiff,axis=0) / 2.0
result = result + exp(-energy)
Kernel density estimation is a way to estimate the probability density function (PDF) of a random variable in a non-parametric way. gaussian_kde works for both uni-variate and multi-variate data. It includes automatic bandwidth determination.
Kdeplot is a Kernel Distribution Estimation Plot which depicts the probability density function of the continuous or non-parametric data variables i.e. we can plot for the univariate or multiple variables altogether. Using the Python Seaborn module, we can build the Kdeplot with various functionality added to it.
KDE Plot described as Kernel Density Estimate is used for visualizing the Probability Density of a continuous variable. It depicts the probability density at different values in a continuous variable. We can also plot a single graph for multiple samples which helps in more efficient data visualization.
Seaborn has in general two ways to calculate the bivariate kde. If available, it uses statsmodels
, if not, it falls back to scipy
.
The scipy code is similar to what is shown in the question. It uses scipy.stats.gaussian_kde
. The statsmodels code uses statsmodels.nonparametric.api.KDEMultivariate
.
However, for a fair comparisson we would need to take the same grid size for both methods. The standard gridsize for seaborn is 100 points.
import numpy as np; np.random.seed(42)
import seaborn.distributions as sd
N = 34900
x = np.random.randn(N)
y = np.random.randn(N)
bw="scott"
gridsize=100
cut=3
clip = [(-np.inf, np.inf), (-np.inf, np.inf)]
f = lambda x,y : sd._statsmodels_bivariate_kde(x, y, bw, gridsize, cut, clip)
g = lambda x,y : sd._scipy_bivariate_kde(x, y, bw, gridsize, cut, clip)
If we time those two functions,
# statsmodels
%timeit f(x,y) # 1 loop, best of 3: 16.4 s per loop
# scipy
%timeit g(x,y) # 1 loop, best of 3: 8.67 s per loop
Scipy is hence twice as fast as statsmodels (the seaborn default). The reason why the code in the question takes so long is that instead of a grid of size 100, a grid of size 2000 is used.
Seeing those results one would actually be tempted to use scipy instead of statsmodels. Unfortunately it does not allow to choose which one to use. One hence needs to manually set the respective flag.
import seaborn.distributions as sd
sd._has_statsmodels = False
# plot kdeplot with scipy.stats.kde.gaussian_kde
sns.kdeplot(x,y)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With