Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scatter plot with a huge amount of data

I would like to use Matplotlib to generate a scatter plot with a huge amount of data (about 3 million points). Actually I've 3 vectors with the same dimension and I use to plot in the following way.

import matplotlib.pyplot as plt
import numpy as np
from numpy import *
from matplotlib import rc
import pylab
from pylab import * 
fig = plt.figure()
fig.subplots_adjust(bottom=0.2)
ax = fig.add_subplot(111)
plt.scatter(delta,vf,c=dS,alpha=0.7,cmap=cm.Paired)

Nothing special actually. But it takes too long to generate it actually (I'm working on my MacBook Pro 4 GB RAM with Python 2.7 and Matplotlib 1.0). Is there any way to improve the speed?

like image 350
Nicola Vianello Avatar asked Nov 02 '10 21:11

Nicola Vianello


People also ask

What is scatter plot in big data?

Scatter plots are the graphs that present the relationship between two variables in a data-set. It represents data points on a two-dimensional plane or on a Cartesian system. The independent variable or attribute is plotted on the X-axis, while the dependent variable is plotted on the Y-axis.

Can a scatter plot have more than 2 variables?

You can have more than one dependent variable. Your data set might include more than one dependent variable, and you can still track this on a scatter plot. The only thing you'll want to change is the color of each dependent variable so that you can measure them against each other on the scatter plot.


2 Answers

Unless your graphic is huge, many of those 3 million points are going to overlap. (A 400x600 image only has 240K dots...)

So the easiest thing to do would be to take a sample of say, 1000 points, from your data:

import random
delta_sample=random.sample(delta,1000)

and just plot that.

For example:

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
import random

fig = plt.figure()
fig.subplots_adjust(bottom=0.2)
ax = fig.add_subplot(111)

N=3*10**6
delta=np.random.normal(size=N)
vf=np.random.normal(size=N)
dS=np.random.normal(size=N)

idx=random.sample(range(N),1000)

plt.scatter(delta[idx],vf[idx],c=dS[idx],alpha=0.7,cmap=cm.Paired)
plt.show()

alt text

Or, if you need to pay more attention to outliers, then perhaps you could bin your data using np.histogram, and then compose a delta_sample which has representatives from each bin.

Unfortunately, when using np.histogram I don't think there is any easy way to associate bins with individual data points. A simple, but approximate solution is to use the location of a point in or on the bin edge itself as a proxy for the points in it:

xedges=np.linspace(-10,10,100)
yedges=np.linspace(-10,10,100)
zedges=np.linspace(-10,10,10)
hist,edges=np.histogramdd((delta,vf,dS), (xedges,yedges,zedges))
xidx,yidx,zidx=np.where(hist>0)
plt.scatter(xedges[xidx],yedges[yidx],c=zedges[zidx],alpha=0.7,cmap=cm.Paired)
plt.show()

alt text

like image 192
unutbu Avatar answered Oct 12 '22 12:10

unutbu


What about trying pyplot.hexbin? It generates a sort of heatmap based on point density in a set number of bins.

like image 22
conjectures Avatar answered Oct 12 '22 11:10

conjectures