KDE is very slow with large data

Question

When I try to make a scatter plot, colored by density, it takes forever.

Probably because the length of the data is quite big.

This is basically how I do it:

xy = np.vstack([np.array(x_values),np.array(y_values)])
z = gaussian_kde(xy)(xy)
plt.scatter(np.array(x_values), np.array(x_values), c=z, s=100, edgecolor='')

As an additional info, I have to add that:

>>len(x_values)
809649

>>len(y_values)
809649

Is it any other option to get the same results but with better speed results?

Giacomo Catenazzi · Accepted Answer

No, there is not good solutions.

Every point should be prepared, and a circle is drawn, which probably will be hidden by other points.

My tricks: (note these point may change slightly the output)

get minimum and maximum, and set image on such size, so that figure needs not to be redone.
remove data, as much as possible:
- duplicate data
- convert with a chosen precision (e.g. of floats) and remove duplicate data. You may calculate the precision with half size of the dot (or with resolution of graph, if you want the original look).
Less data: more speed. Removal is far quicker than drawing a point in a graph (which will be overwritten).
Often heatmaps are more interesting for huge data sets: it gives more information. But in your case, I think you still have too much data.

Note: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html#scipy.stats.gaussian_kde has also a nice example (just 2000 points). In any case, this pages uses also my first point.

KDE is very slow with large data

Tags:

performance

python

kernel-density

codeKiller

1 Answers

Giacomo Catenazzi

Recent Activity

Donate For Us

KDE is very slow with large data

Tags:

performance

python

kernel-density

codeKiller

1 Answers

Giacomo Catenazzi

Related questions

Recent Activity

Donate For Us