What makes the distance measure in k-medoid "better" than k-means?

Q: Which method is more robust K-Means or k-medoids and why?

K- Medoids is more robust as compared to K-Means as in K-Medoids we find k as representative object to minimize the sum of dissimilarities of data objects whereas, K-Means used sum of squared Euclidean distances for data objects. And this distance metric reduces noise and outliers.

Tags:

machine-learning

cluster-analysis

k-means

data-mining

I am reading about the difference between k-means clustering and k-medoid clustering.

Supposedly there is an advantage to using the pairwise distance measure in the k-medoid algorithm, instead of the more familiar sum of squared Euclidean distance-type metric to evaluate variance that we find with k-means. And apparently this different distance metric somehow reduces noise and outliers.

I have seen this claim but I have yet to see any good reasoning as to the mathematics behind this claim.

What makes the pairwise distance measure commonly used in k-medoid better? More exactly, how does the lack of a squared term allow k-medoids to have the desirable properties associated with the concept of taking a median?

235

asked Feb 07 '14 05:02

tumultous_rooster

3 Answers

1. K-medoid is more flexible

First of all, you can use k-medoids with any similarity measure. K-means however, may fail to converge - it really must only be used with distances that are consistent with the mean. So e.g. Absolute Pearson Correlation must not be used with k-means, but it works well with k-medoids.

2. Robustness of medoid

Secondly, the medoid as used by k-medoids is roughly comparable to the median (in fact, there also is k-medians, which is like K-means but for Manhattan distance). If you look up literature on the median, you will see plenty of explanations and examples why the median is more robust to outliers than the arithmetic mean. Essentially, these explanations and examples will also hold for the medoid. It is a more robust estimate of a representative point than the mean as used in k-means.

Consider this 1-dimensional example:

[1, 2, 3, 4, 100000]

Both the median and medoid of this set are 3. The mean is 20002.

Which do you think is more representative of the data set? The mean has the lower squared error, but assuming that there might be a measurement error in this data set ...

Technically, the notion of breakdown point is used in statistics. The median has a breakdown point of 50% (i.e. half of the data points can be incorrect, and the result is still unaffected), whereas the mean has a breakdown point of 0 (i.e. a single large observation can yield a bad estimate).

I do not have a proof, but I assume the medoid will have a similar breakdown point as the median.

3. k-medoids is much more expensive

That's the main drawback. Usually, PAM takes much longer to run than k-means. As it involves computing all pairwise distances, it is O(n^2*k*i); whereas k-means runs in O(n*k*i) where usually, k times the number of iterations is k*i << n.

answered Oct 21 '22 17:10

Has QUIT--Anony-Mousse

I think this has to do with the selection of the center for the cluster. k-means will select the "center" of the cluster, while k-medoid will select the "most centered" member of the cluster. In a cluster with outliers (i.e. points far away from the other members of the cluster) k-means will place the center of the cluster towards the outliers, whereas k-medoid will select one of the more clustered members (the medoid) as the center.

It now depends on what you use clustering for. If you just wanted to classify a bunch of objects then you don't really care about where the center is; but if the clustering was used to train a decider which will now classify new objects based on those center points, then k-medoid will give you a center closer to where a human would place the center.

In wikipedia's words:

"It [k-medoid] is more robust to noise and outliers as compared to k-means because it minimizes a sum of pairwise dissimilarities instead of a sum of squared Euclidean distances."

Here's an example:

Suppose you want to cluster on one dimension with k=2. One cluster has most of its members around 1000 and the other around -1000; but there is an outlier (or noise) at 100000. It obviously belongs to the cluster around 1000 but k-means will put the center point away from 1000 and towards 100000. This may even make some of the members of the 1000 cluster (say a member with value 500) to be assigned to the -1000 cluster. k-medoid will select one of the members around 1000 as the medoid, it'll probably select one that is bigger than 1000, but it will not select an outlier.

answered Oct 21 '22 17:10

Eli Algranti

Just a tiny note added to @Eli's answer, K-medoid is more robust to noise and outliers than k-means because the latter selects the cluster center, which is mostly just a "virtue point", on the other hand the former chooses the "actual object" from the cluster.

Suppose you have five 2D points in one cluster with the coordinates of (1,1),(1,2),(2,1),(2,2), and (100,100). If we don't consider the object exchanges among the clusters, with k-means you will get the center of cluster (21.2,21.2) which is pretty distracted by the point (100,100). However, with k-medoid will choose the center among (1,1),(1,2),(2,1),and (2,2) according to its algorithm.

Here is a fun applet ( E.M. Mirkes, K-means and K-medoids applet. University of Leicester, 2011 ) that you can randomly generate dataset in the 2D plane and compare k-medoid and k-means learning process.

answered Oct 21 '22 18:10

lennon310

Related questions
                            
                                Recommended anomaly detection technique for simple, one-dimensional scenario?
                            
                                When should one use LinearSVC or SVC?
                            
                                How to engineer features for machine learning [closed]
                            
                                sklearn doesn't have attribute 'datasets'
                            
                                How to convert keras(h5) file to a tflite file?
                            
                                Python - How to intuit word from abbreviated text using NLP?
                            
                                Scikit-learn confusion matrix
                            
                                muti output regression in xgboost
                            
                                Submitting Assignment on Coursera ML in Octave
                            
                                How to use advanced activation layers in Keras?
                            
                                Use sklearn's GridSearchCV with a pipeline, preprocessing just once
                            
                                Unbalanced classification using RandomForestClassifier in sklearn
                            
                                ValueError: Layer sequential_20 expects 1 inputs, but it received 2 input tensors
                            
                                When should I use support vector machines as opposed to artificial neural networks?
                            
                                Calculate the Cumulative Distribution Function (CDF) in Python
                            
                                How to interpret scikit's learn confusion matrix and classification report?
                            
                                How to graph grid scores from GridSearchCV?
                            
                                Large scale machine learning - Python or Java? [closed]
                            
                                What is the difference between SVC and SVM in scikit-learn?
                            
                                Help Understanding Cross Validation and Decision Trees

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With