Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparing sets of 2D data/scatterplots

I have 2000 sets of data which contain little over 1000 2D variables each. I'm looking to cluster these sets of data into anywhere from 20-100 clusters based on similarity. However, I'm having trouble coming up with a reliable method of comparing sets of data. I've tried a few (rather primitive) approaches and done loads of research, but I can't seem to find anything that fits what I need to do.

I've posted an image below of 3 sets of my data plotted. The data is bounded 0-1 in the y axis, and is within the ~0-0.10 range in the x axis (in practice, but could be greater then 0.10 in theory).

The shape and relative proportions of the data are probably the most important things to compare. However, the absolute locations of each data set are important as well. In other words, the closer the relative position of each individual point to the individual points of another dataset, the more similar they would be and then their absolute positions would need to be accounted for.

Green and red should be considered as very different, but push comes to shove, they should be more similar than blue and red.

http://img153.imageshack.us/img153/6730/screenshot20110204at004.png

I have tried to:

  • compare based on overall overages and deviation
  • split the variables into coordinate regions (ie (0-0.10, 0-0.10), (0.10-0.20, 0.10-0.20)...(0.9-1.0, 0.9-1.0)) and compare similarity based on shared points within regions
  • I've tried measuring the average euclidean distance to nearest neighbours among the data sets

All of these have produced faulty results. The closest answer I could find in my research was "Appropriate similarity metrics for multiple sets of 2D coordinates". However, the answer given there suggests comparing the average distance among nearest neighbours from the centroid, which I don't think will work for me as the direction, is as important as the distance for my purposes.

I might add, that this will be used to generate data for the input of another program and will only be used sporadically (mainly to generate different sets of data with different numbers of clusters), so semi time consuming algorithms are not out of the question.

like image 754
mcnulty Avatar asked Feb 05 '11 16:02

mcnulty


People also ask

How 2d scatter plots are used in data analysis?

Scatter plots are used to observe and plot relationships between two numeric variables graphically with the help of dots. The dots in a scatter plot shows the values of individual data points.

What are the 3 types of scatter plots?

The kind of correlation can be interpreted through the patterns revealed on a Scatterplot. These are: positive (values increase together), negative (one value decreases as the other increases) or null (no correlation). The shape of the correlation can be described as: linear, exponential and U-shaped.

How do you describe a scatter plot trend?

A scatter plot shows a positive trend if y tends to increase as x increases. A scatter plot shows a negative trend if y tends to decrease as x increases. A scatter plot shows no trend if there is no obvious pattern.


1 Answers

In two steps

1) First: To tell apart blues.

Compute the mean nearest neighbor distance, up to a cutoff. Select the cutoff something like the black distance in the following image:

enter image description here

The blue configurations, as they are more scattered will give you results much greater than the reds and greens.

2) Second: To tell apart reds and greens

Disregard all points whose nearest neighbor distance is more than something smaller (for example one fourth of the previous distance). Clusterize for proximity so to get clusters of the form:

enter image description here and enter image description here

Discard the clusters with less than 10 points (or so). For each cluster run a linear fit and calculate covariances. The mean covariance for red will be much higher than for green since greens are very aligned in this scale.

There you are.

HTH!

like image 158
Dr. belisarius Avatar answered Oct 01 '22 23:10

Dr. belisarius