I have 2000 sets of data which contain little over 1000 2D variables each. I'm looking to cluster these sets of data into anywhere from 20-100 clusters based on similarity. However, I'm having trouble coming up with a reliable method of comparing sets of data. I've tried a few (rather primitive) approaches and done loads of research, but I can't seem to find anything that fits what I need to do.
I've posted an image below of 3 sets of my data plotted. The data is bounded 0-1 in the y axis, and is within the ~0-0.10 range in the x axis (in practice, but could be greater then 0.10 in theory).
The shape and relative proportions of the data are probably the most important things to compare. However, the absolute locations of each data set are important as well. In other words, the closer the relative position of each individual point to the individual points of another dataset, the more similar they would be and then their absolute positions would need to be accounted for.
Green and red should be considered as very different, but push comes to shove, they should be more similar than blue and red.
I have tried to:
All of these have produced faulty results. The closest answer I could find in my research was "Appropriate similarity metrics for multiple sets of 2D coordinates". However, the answer given there suggests comparing the average distance among nearest neighbours from the centroid, which I don't think will work for me as the direction, is as important as the distance for my purposes.
I might add, that this will be used to generate data for the input of another program and will only be used sporadically (mainly to generate different sets of data with different numbers of clusters), so semi time consuming algorithms are not out of the question.
Scatter plots are used to observe and plot relationships between two numeric variables graphically with the help of dots. The dots in a scatter plot shows the values of individual data points.
The kind of correlation can be interpreted through the patterns revealed on a Scatterplot. These are: positive (values increase together), negative (one value decreases as the other increases) or null (no correlation). The shape of the correlation can be described as: linear, exponential and U-shaped.
A scatter plot shows a positive trend if y tends to increase as x increases. A scatter plot shows a negative trend if y tends to decrease as x increases. A scatter plot shows no trend if there is no obvious pattern.
In two steps
1) First: To tell apart blues.
Compute the mean nearest neighbor distance, up to a cutoff. Select the cutoff something like the black distance in the following image:
The blue configurations, as they are more scattered will give you results much greater than the reds and greens.
2) Second: To tell apart reds and greens
Disregard all points whose nearest neighbor distance is more than something smaller (for example one fourth of the previous distance). Clusterize for proximity so to get clusters of the form:
and
Discard the clusters with less than 10 points (or so). For each cluster run a linear fit and calculate covariances. The mean covariance for red will be much higher than for green since greens are very aligned in this scale.
There you are.
HTH!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With