Why is it scipy.stats.gaussian_kde() slower than seaborn.kde_plot() for the same data?

Tags:

In python 3.7, I have this numpy array with shape=(2, 34900). This arrays is a list of coordinates where the index 0 represents the X axis and the index 1 represents the y axis.

When I use seaborn.kde_plot() to make a visualization of the distribution of this data, I'm able to get the result in about 5-15 seconds when running on a i5 7th generation.

But when I try to run the following piece of code:

#Find the kernel for 
k = scipy.stats.kde.gaussian_kde(data, bw_method=.3)
#Define the grid
xi, yi = np.mgrid[0:1:2000*1j, 0:1:2000*1j]
#apply the function
zi = k(np.vstack([xi.flatten(), yi.flatten()]))

which finds the gaussian kernel for this data and applies it to a grid I defined, it takes much more time. I wasn't able to run the full array but when running on a slice with the size of 140, it takes about 40 seconds to complete.

The 140 sized slice does make an interesting result which I was able to visualize using plt.pcolormesh().

My question is what I am missing here. If I understood what is happening correctly, I'm using the scipy.stats.kde.gaussian_kde() to create an estimation of a function defined by the data. Then I'm applying the function to a 2D space and getting it's Z component as result. Then I'm plotting the Z component. But how can this process be any different from seaborn.kde_plot() that makes the code take so much longer.

Scipy's implementation just goes through each point doing this:

for i in range(self.n):
    diff = self.dataset[:, i, newaxis] - points
    tdiff = dot(self.inv_cov, diff)
    energy = sum(diff*tdiff,axis=0) / 2.0
    result = result + exp(-energy)

609

asked Mar 13 '18 21:03

Victor

1 Answers

Seaborn has in general two ways to calculate the bivariate kde. If available, it uses statsmodels, if not, it falls back to scipy.

The scipy code is similar to what is shown in the question. It uses scipy.stats.gaussian_kde. The statsmodels code uses statsmodels.nonparametric.api.KDEMultivariate.

However, for a fair comparisson we would need to take the same grid size for both methods. The standard gridsize for seaborn is 100 points.

import numpy as np; np.random.seed(42)
import seaborn.distributions as sd

N = 34900
x = np.random.randn(N)
y = np.random.randn(N)
bw="scott"
gridsize=100
cut=3
clip = [(-np.inf, np.inf), (-np.inf, np.inf)]

f = lambda x,y : sd._statsmodels_bivariate_kde(x, y, bw, gridsize, cut, clip)
g = lambda x,y : sd._scipy_bivariate_kde(x, y, bw, gridsize, cut, clip)

If we time those two functions,

# statsmodels
%timeit f(x,y)  # 1 loop, best of 3: 16.4 s per loop
# scipy
%timeit g(x,y)  # 1 loop, best of 3: 8.67 s per loop

Scipy is hence twice as fast as statsmodels (the seaborn default). The reason why the code in the question takes so long is that instead of a grid of size 100, a grid of size 2000 is used.

Seeing those results one would actually be tempted to use scipy instead of statsmodels. Unfortunately it does not allow to choose which one to use. One hence needs to manually set the respective flag.

import seaborn.distributions as sd
sd._has_statsmodels = False
# plot kdeplot with scipy.stats.kde.gaussian_kde
sns.kdeplot(x,y)

165

answered Nov 15 '22 04:11

ImportanceOfBeingErnest

Related questions
                            
                                Numpy percentiles with linear interpolation - wrong value?
                            
                                Intermittent "getrandom() initialization failed" using scrapy spider
                            
                                is there a magic method for sorted() in Python?
                            
                                Remove non-ASCII characters from string columns in pandas
                            
                                "set_UVC" equivilent for a 3D quiver plot in matplotlib
                            
                                Creating a hotkey to enter text using python, running in background waiting for key-press
                            
                                Python Hadoop streaming on windows, Script not a valid Win32 application
                            
                                Pandas - add NaN for missing values when pd.merge
                            
                                What does Selenium .set_script_timeout(n) do and how is it different from driver.set_page_load_timeout(n)?
                            
                                Iteration over columns and rows in Pandas Dataframe
                            
                                Boto3 AWS API error responses for SSM
                            
                                ResultSet object has no attribute 'find_all'
                            
                                Tinting an image in Pygame
                            
                                How to append list of numerous types to single string (python)
                            
                                Is pandas showing the wrong percentile?
                            
                                What does RepeatedKFold actually mean?
                            
                                Regular expression must contain and may only contain
                            
                                Pythonic way to apply format to all strings in dictionary without f-strings
                            
                                How to support %x formatting on a class that emulates int
                            
                                Creating a column based on multiple conditions

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is it scipy.stats.gaussian_kde() slower than seaborn.kde_plot() for the same data?

Tags:

python

scipy

seaborn

distribution

gaussian

Victor

People also ask

1 Answers

ImportanceOfBeingErnest

Recent Activity

Donate For Us