How to cluster multivariate angular data? Distance measures and algorithms

Tags:

I'd like to cluster a set of multidimensional vectors (n > 10) in which each attribute is an angle. What distance measures and algorithms can I use?

I thought of:
- manhattan distance
- taking max/min of distances between pairs of attributes (http://www.ncbi.nlm.nih.gov/pubmed/9390236)
- summing angular distances between all pairs of attributes

When it comes to distance measures, Euclidean distance seems very natural and intuitive even for objects located in multidimensional space. However, I didn't found some kind of equivalent for angles.

And algorithms:
- affinity propagation
- dbscan
- in general terms, scikit-learn algorithms, except for K-Means. (http://scikit-learn.org/stable/modules/clustering.html#clustering)

Here are some examples: ['179.5', '58.8', '78.2', '211.8', '295.6', '194.9', '9.3', '328.3', '40.9', '323.1', '17.2']
['171.4', '74.9', '81.5', '204.4', '284.1', '193.8', '2.1', '326.7', '49.3', '310.4', '30.5']
['64.2', '119.8', '147.2', '213.0', '167.4', '256.4', '349.4', '28.3', '325.6', '29.6', '348.0']
By the way, these numbers are dihedral angles.

577

asked Aug 26 '14 14:08

cafe_

2 Answers

Consider mapping the angle to the unit circle. That way distances are close even if two angles are -pi and pi. This would mean that each vector goes from being n-dimensional to (2n)-dimensional.

Then, I'd try all the normal distance measurements.

answered Sep 19 '22 23:09

Unapiedra

If you plan on using k-means, you must really map the data to Euclidean space, i.e. to sin(angle), cos(angle) for each angle. The reason is that otherwise, the mean function does not make sense: the mean of angles -179 and +179 should be -180 (or +180), but when done naively, the mean would be 0, which is the opposite!

If you give other algorithms a try, such as HAC, PAM, CLARA, DBSCAN, OPTICS etc. then you can define a custom distance function, which handles the 360° wrap-around. For example, you could use

min(abs(x-y), 360-abs(x-y))

and then compute the sum of these, or the sum of squares.

But this approach does not work with k-means!

answered Sep 20 '22 23:09

Has QUIT--Anony-Mousse

Related questions
                            
                                How can I check if a record exists when passing a dataframe to SQL in pandas?
                            
                                How can I overload a built-in module in python?
                            
                                Cython direct access to Global Variable
                            
                                Change Background color of Gtk.Entry in Gtk3
                            
                                Arc, pie cut, in svgwrite
                            
                                Scipy ndimage morphology operators saturate my computer memory RAM (8GB)
                            
                                Add multiple rows into google spreadsheet using API
                            
                                In Python, can Matplotlib's Contour function return the points of a particular contour line?
                            
                                How to create class object without __dict__
                            
                                Python timezone full names
                            
                                Use OpenPyXL to iterate through sheets and cells, and update cells with contantenated string [duplicate]
                            
                                NonExistentTimeError when localizing pandas datetime index
                            
                                Celery: log each task run to it's own file?
                            
                                Binning of continuous variables in sklearn ensemble and trees
                            
                                DevStack Installation Error: Directory 'opt/stack/nova' is not installable
                            
                                How to connect with Python IMAP4_SSL and self-signed server SSL cert?
                            
                                Aliasing the dict keys in a Django QuerySet.values call
                            
                                PyQt not recognizing arrow keys
                            
                                More than one module for lambdify in sympy
                            
                                Rounding errors with floats in Python using Numpy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to cluster multivariate angular data? Distance measures and algorithms

Tags:

python

distance

cluster-analysis

scikit-learn

cafe_

People also ask

2 Answers

Unapiedra

Has QUIT--Anony-Mousse

Recent Activity

Donate For Us