Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

KDE is very slow with large data

When I try to make a scatter plot, colored by density, it takes forever.

Probably because the length of the data is quite big.

This is basically how I do it:

xy = np.vstack([np.array(x_values),np.array(y_values)])
z = gaussian_kde(xy)(xy)
plt.scatter(np.array(x_values), np.array(x_values), c=z, s=100, edgecolor='')

As an additional info, I have to add that:

>>len(x_values)
809649

>>len(y_values)
809649

Is it any other option to get the same results but with better speed results?

like image 759
codeKiller Avatar asked Jan 27 '15 15:01

codeKiller


1 Answers

No, there is not good solutions.

Every point should be prepared, and a circle is drawn, which probably will be hidden by other points.

My tricks: (note these point may change slightly the output)

  • get minimum and maximum, and set image on such size, so that figure needs not to be redone.

  • remove data, as much as possible:

    • duplicate data

    • convert with a chosen precision (e.g. of floats) and remove duplicate data. You may calculate the precision with half size of the dot (or with resolution of graph, if you want the original look).

    Less data: more speed. Removal is far quicker than drawing a point in a graph (which will be overwritten).

  • Often heatmaps are more interesting for huge data sets: it gives more information. But in your case, I think you still have too much data.

Note: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html#scipy.stats.gaussian_kde has also a nice example (just 2000 points). In any case, this pages uses also my first point.

like image 90
Giacomo Catenazzi Avatar answered Oct 09 '22 00:10

Giacomo Catenazzi