I'd like to cluster a set of multidimensional vectors (n > 10) in which each attribute is an angle. What distance measures and algorithms can I use?
I thought of:
- manhattan distance
- taking max/min of distances between pairs of attributes (http://www.ncbi.nlm.nih.gov/pubmed/9390236)
- summing angular distances between all pairs of attributes
When it comes to distance measures, Euclidean distance seems very natural and intuitive even for objects located in multidimensional space. However, I didn't found some kind of equivalent for angles.
And algorithms:
- affinity propagation
- dbscan
- in general terms, scikit-learn algorithms, except for K-Means. (http://scikit-learn.org/stable/modules/clustering.html#clustering)
Here are some examples:
['179.5', '58.8', '78.2', '211.8', '295.6', '194.9', '9.3', '328.3', '40.9', '323.1', '17.2']
['171.4', '74.9', '81.5', '204.4', '284.1', '193.8', '2.1', '326.7', '49.3', '310.4', '30.5']
['64.2', '119.8', '147.2', '213.0', '167.4', '256.4', '349.4', '28.3', '325.6', '29.6', '348.0']
By the way, these numbers are dihedral angles.
For most common clustering software, the default distance measure is the Euclidean distance. Depending on the type of the data and the researcher questions, other dissimilarity measures might be preferred. For example, correlation-based distance is often used in gene expression data analysis.
For most common hierarchical clustering software, the default distance measure is the Euclidean distance. This is the square root of the sum of the square differences. However, for gene expression, correlation distance is often used. The distance between two vectors is 0 when they are perfectly correlated.
While k-means, the simplest and most prominent clustering algorithm, generally uses Euclidean distance as its similarity distance measurement, contriving innovative or variant clustering algorithms which, among other alterations, utilize different distance measurements is not a stretch.
Although Euclidean distance is very common in clustering, it has a drawback: if two data vectors have no attribute values in common, they may have a smaller distance than the other pair of data vectors containing the same attribute values [31,35,36].
Consider mapping the angle to the unit circle. That way distances are close even if two angles are -pi and pi. This would mean that each vector goes from being n-dimensional to (2n)-dimensional.
Then, I'd try all the normal distance measurements.
If you plan on using k-means, you must really map the data to Euclidean space, i.e. to sin(angle), cos(angle)
for each angle. The reason is that otherwise, the mean function does not make sense: the mean of angles -179
and +179
should be -180
(or +180
), but when done naively, the mean would be 0
, which is the opposite!
If you give other algorithms a try, such as HAC, PAM, CLARA, DBSCAN, OPTICS etc. then you can define a custom distance function, which handles the 360° wrap-around. For example, you could use
min(abs(x-y), 360-abs(x-y))
and then compute the sum of these, or the sum of squares.
But this approach does not work with k-means!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With