Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Visualization of scatter plots with overlapping points in matplotlib

I have to represent about 30,000 points in a scatter plot in matplotlib. These points belong to two different classes, so I want to depict them with different colors.

I succeded in doing so, but there is an issue. The points overlap in many regions and the class that I depict for last will be visualized on top of the other one, hiding it. Furthermore, with the scatter plot is not possible to show how many points lie in each region. I have also tried to make a 2d histogram with histogram2d and imshow, but it's difficult to show the points belonging to both classes in a clear way.

Can you suggest a way to make clear both the distribution of the classes and the concentration of the points?

EDIT: To be more clear, this is the link to my data file in the format "x,y,class"

like image 943
papafe Avatar asked Sep 28 '13 08:09

papafe


People also ask

Which helps us see the distribution of data when scatterplot points are overlapping?

If you use a scatter plot for a dataset that has discrete values in one dimension, for example your x-axis shows the days of the week, you can get points overlapping when you plot the data. To make the chart easier to interpret you can introduce jitter to the data points.

How do I overlay two Scatterplots in Matplotlib?

You simply call the scatter function twice, matplotlib will superimpose the two plots for you. You might want to specify a color, as the default for all scatter plots is blue. This is perhaps why you were only seeing one plot.


1 Answers

One approach is to plot the data as a scatter plot with a low alpha, so you can see the individual points as well as a rough measure of density. (The downside to this is that the approach has a limited range of overlap it can show -- i.e., a maximum density of about 1/alpha.)

Here's an example:

enter image description here

As you can imagine, because of the limited range of overlaps that can be expressed, there's a tradeoff between visibility of the individual points and the expression of amount of overlap (and the size of the marker, plot, etc).

import numpy as np import matplotlib.pyplot as plt  N = 10000 mean = [0, 0] cov = [[2, 2], [0, 2]] x,y = np.random.multivariate_normal(mean, cov, N).T  plt.scatter(x, y, s=70, alpha=0.03) plt.ylim((-5, 5)) plt.xlim((-5, 5)) plt.show() 

(I'm assuming here you meant 30e3 points, not 30e6. For 30e6, I think some type of averaged density plot would be necessary.)

like image 59
tom10 Avatar answered Sep 21 '22 00:09

tom10