Comparing sets of 2D data/scatterplots

Tags:

I have 2000 sets of data which contain little over 1000 2D variables each. I'm looking to cluster these sets of data into anywhere from 20-100 clusters based on similarity. However, I'm having trouble coming up with a reliable method of comparing sets of data. I've tried a few (rather primitive) approaches and done loads of research, but I can't seem to find anything that fits what I need to do.

I've posted an image below of 3 sets of my data plotted. The data is bounded 0-1 in the y axis, and is within the ~0-0.10 range in the x axis (in practice, but could be greater then 0.10 in theory).

The shape and relative proportions of the data are probably the most important things to compare. However, the absolute locations of each data set are important as well. In other words, the closer the relative position of each individual point to the individual points of another dataset, the more similar they would be and then their absolute positions would need to be accounted for.

Green and red should be considered as very different, but push comes to shove, they should be more similar than blue and red.

I have tried to:

compare based on overall overages and deviation
split the variables into coordinate regions (ie (0-0.10, 0-0.10), (0.10-0.20, 0.10-0.20)...(0.9-1.0, 0.9-1.0)) and compare similarity based on shared points within regions
I've tried measuring the average euclidean distance to nearest neighbours among the data sets

All of these have produced faulty results. The closest answer I could find in my research was "Appropriate similarity metrics for multiple sets of 2D coordinates". However, the answer given there suggests comparing the average distance among nearest neighbours from the centroid, which I don't think will work for me as the direction, is as important as the distance for my purposes.

I might add, that this will be used to generate data for the input of another program and will only be used sporadically (mainly to generate different sets of data with different numbers of clusters), so semi time consuming algorithms are not out of the question.

754

asked Feb 05 '11 16:02

mcnulty

1 Answers

In two steps

1) First: To tell apart blues.

Compute the mean nearest neighbor distance, up to a cutoff. Select the cutoff something like the black distance in the following image:

enter image description here

The blue configurations, as they are more scattered will give you results much greater than the reds and greens.

2) Second: To tell apart reds and greens

Disregard all points whose nearest neighbor distance is more than something smaller (for example one fourth of the previous distance). Clusterize for proximity so to get clusters of the form:

enter image description here and

Discard the clusters with less than 10 points (or so). For each cluster run a linear fit and calculate covariances. The mean covariance for red will be much higher than for green since greens are very aligned in this scale.

There you are.

HTH!

158

answered Oct 01 '22 23:10

Dr. belisarius

Related questions
                            
                                Find minimal wire connection
                            
                                Intersection Between Two String Sets with Substring Compare
                            
                                How to minimize the mutex locking for an object when only 1 thread mostly uses that object and the other thread(s) use it rarely?
                            
                                Alien Dictionary Python
                            
                                How to maintain circle velocity after colliding with a square?
                            
                                How do I find the maximum sum of subarray if i have to delete the largest element in the subarray
                            
                                JavaScript Algorithm For Continuous Filtering Data On Most Recent Results
                            
                                Implementing Excel and VB's IRR function
                            
                                Need of PRBS Pattern Generating C/C++ API [closed]
                            
                                How to spot and analyse similar patterns like Excel does?
                            
                                Implementation of set reconciliation algorithm
                            
                                Object Positioning Algorithm
                            
                                grouping strings by similarity
                            
                                Google Maps algorithm
                            
                                Algorithm to project light and detect if a given point falls within it?
                            
                                Algorithm for edge intersection?
                            
                                Simplifying a cubic bezier path?
                            
                                OCR error correction: How to combine three erroneous results to reduce errors
                            
                                Efficient 2D FFT on real input data?
                            
                                How can I reduce a grid of equal sized squares to a minimum set of rectangles?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Comparing sets of 2D data/scatterplots

Tags:

language-agnostic

algorithm

graphics

cluster-analysis

similarity

mcnulty

People also ask

1 Answers

Dr. belisarius

Recent Activity

Donate For Us