Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Density scatter plot for huge dataset in matplotlib

I wrote some code a while ago that used gaussian kde to make simple density scatter plots. However, for datasets larger than about 100,000 points, it just ran 'forever' (I killed it after a few days). A friend gave me some code in R that could create such a density plot in seconds (plot_fun.R), and it seems like matplotlib should be able to do the same thing.

I think the right place to look is 2d histograms, but I am struggling to get the density to be 'right'. I modified code I found at this question to accomplish this, but the density is not showing, it looks like only the densist posible points are getting any color.

Here is approximately the code I am using:

# initial data
x = -np.log10(np.random.random_sample(10000))
y = -np.log10(np.random.random_sample(10000))

#histogram definition
bins = [1000, 1000] # number of bins
thresh = 3  #density threshold

#data definition
mn = min(x.min(), y.min())
mx = max(x.max(), y.max())
mn = mn-(mn*.1)
mx = mx+(mx*.1)
xyrange = [[mn, mx], [mn, mx]]

# histogram the data
hh, locx, locy = np.histogram2d(x, y, range=xyrange, bins=bins)
posx = np.digitize(x, locx)
posy = np.digitize(y, locy)

#select points within the histogram
ind = (posx > 0) & (posx <= bins[0]) & (posy > 0) & (posy <= bins[1])
hhsub = hh[posx[ind] - 1, posy[ind] - 1] # values of the histogram where the points are
xdat1 = x[ind][hhsub < thresh] # low density points
ydat1 = y[ind][hhsub < thresh]
hh[hh < thresh] = np.nan # fill the areas with low density by NaNs

f, a = plt.subplots(figsize=(12,12))

c = a.imshow(
    np.flipud(hh.T), cmap='jet',
    extent=np.array(xyrange).flatten(), interpolation='none',
    origin='upper'
)
f.colorbar(c, ax=ax, orientation='vertical', shrink=0.75, pad=0.05)
s = a.scatter(
    xdat1, ydat1, color='darkblue', edgecolor='', label=None,
    picker=True, zorder=2
) 

That produces this plot:

bad plot

The KDE code is here:

f, a = plt.subplots(figsize=(12,12))
xy = np.vstack([x, y])
z = sts.gaussian_kde(xy)(xy)                                           
# Sort the points by density, so that the densest points are       
# plotted last                                                     
idx = z.argsort()                                                  
x2, y2, z = x[idx], y[idx], z[idx]                                 
s = a.scatter(                                                     
    x2, y2, c=z, s=50, cmap='jet',
    edgecolor='', label=None, picker=True, zorder=2                
)   

That produces this plot:

good plot

The problem is, of course, that this code is unusable on large data sets.

My question is: how can I use the 2d histogram to produce a scatter plot like that? ax.hist2d does not produce a useful output, because it colors the whole plot, and all my efforts to get the above 2d histogram data to actually color the dense regions of the plot correctly have failed, I always end up with either no coloring or a tiny percentage of the densest points being colored. Clearly I just don't understand the code very well.

like image 494
Mike Dacre Avatar asked May 03 '26 14:05

Mike Dacre


1 Answers

Your histogram code assigns a unique color (color='darkblue') so what are you expecting? I think you are also over complicating things. This much simpler code works fine:

import numpy as np
import matplotlib.pyplot as plt

x, y = -np.log10(np.random.random_sample((2,10**6)))

#histogram definition
bins = [1000, 1000] # number of bins

# histogram the data
hh, locx, locy = np.histogram2d(x, y, bins=bins)

# Sort the points by density, so that the densest points are plotted last
z = np.array([hh[np.argmax(a<=locx[1:]),np.argmax(b<=locy[1:])] for a,b in zip(x,y)])
idx = z.argsort()
x2, y2, z2 = x[idx], y[idx], z[idx]

plt.figure(1,figsize=(8,8)).clf()
s = plt.scatter(x2, y2, c=z2, cmap='jet', marker='.')  
like image 88
Julien Avatar answered May 05 '26 04:05

Julien



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!