Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to cluster multivariate angular data? Distance measures and algorithms

I'd like to cluster a set of multidimensional vectors (n > 10) in which each attribute is an angle. What distance measures and algorithms can I use?

I thought of:
- manhattan distance
- taking max/min of distances between pairs of attributes (http://www.ncbi.nlm.nih.gov/pubmed/9390236)
- summing angular distances between all pairs of attributes

When it comes to distance measures, Euclidean distance seems very natural and intuitive even for objects located in multidimensional space. However, I didn't found some kind of equivalent for angles.

And algorithms:
- affinity propagation
- dbscan
- in general terms, scikit-learn algorithms, except for K-Means. (http://scikit-learn.org/stable/modules/clustering.html#clustering)

Here are some examples: ['179.5', '58.8', '78.2', '211.8', '295.6', '194.9', '9.3', '328.3', '40.9', '323.1', '17.2']
['171.4', '74.9', '81.5', '204.4', '284.1', '193.8', '2.1', '326.7', '49.3', '310.4', '30.5']
['64.2', '119.8', '147.2', '213.0', '167.4', '256.4', '349.4', '28.3', '325.6', '29.6', '348.0']
By the way, these numbers are dihedral angles.

like image 577
cafe_ Avatar asked Aug 26 '14 14:08

cafe_


People also ask

Which distance measures are used in clustering?

For most common clustering software, the default distance measure is the Euclidean distance. Depending on the type of the data and the researcher questions, other dissimilarity measures might be preferred. For example, correlation-based distance is often used in gene expression data analysis.

What is distance in cluster analysis?

For most common hierarchical clustering software, the default distance measure is the Euclidean distance. This is the square root of the sum of the square differences. However, for gene expression, correlation distance is often used. The distance between two vectors is 0 when they are perfectly correlated.

What is used for calculating distance measures in clustering using Python?

While k-means, the simplest and most prominent clustering algorithm, generally uses Euclidean distance as its similarity distance measurement, contriving innovative or variant clustering algorithms which, among other alterations, utilize different distance measurements is not a stretch.

What is the drawback of Euclidean distance in clustering?

Although Euclidean distance is very common in clustering, it has a drawback: if two data vectors have no attribute values in common, they may have a smaller distance than the other pair of data vectors containing the same attribute values [31,35,36].


2 Answers

Consider mapping the angle to the unit circle. That way distances are close even if two angles are -pi and pi. This would mean that each vector goes from being n-dimensional to (2n)-dimensional.

Then, I'd try all the normal distance measurements.

like image 88
Unapiedra Avatar answered Sep 19 '22 23:09

Unapiedra


If you plan on using k-means, you must really map the data to Euclidean space, i.e. to sin(angle), cos(angle) for each angle. The reason is that otherwise, the mean function does not make sense: the mean of angles -179 and +179 should be -180 (or +180), but when done naively, the mean would be 0, which is the opposite!

If you give other algorithms a try, such as HAC, PAM, CLARA, DBSCAN, OPTICS etc. then you can define a custom distance function, which handles the 360° wrap-around. For example, you could use

min(abs(x-y), 360-abs(x-y))

and then compute the sum of these, or the sum of squares.

But this approach does not work with k-means!

like image 24
Has QUIT--Anony-Mousse Avatar answered Sep 20 '22 23:09

Has QUIT--Anony-Mousse