Why is `sklearn.manifold.MDS` random when `skbio's pcoa` is not?

Tags:

I'm trying to figure out how to implement Principal Coordinate Analysis with various distance metrics. I stumbled across both skbio and sklearn with implementations. I don't understand why sklearn's implementation is different everytime while skbio is the same? Is there a degree of randomness to Multidimensional Scaling and in particular Principal Coordinate Analysis? I see that all of the clusters are very similar but why are they different? Am I implementing this correctly?

Running Principal Coordinate Analysis using Scikit-bio (i.e. Skbio) always gives the same results:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn import decomposition
import seaborn as sns; sns.set_style("whitegrid", {'axes.grid' : False})
import skbio
from scipy.spatial import distance

%matplotlib inline
np.random.seed(0)

# Iris dataset
DF_data = pd.DataFrame(load_iris().data, 
                       index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
                       columns = load_iris().feature_names)
n,m = DF_data.shape
# print(n,m)
# 150 4

Se_targets = pd.Series(load_iris().target, 
                       index = ["iris_%d" % i for i in range(load_iris().data.shape[0])], 
                       name = "Species")

# Scaling mean = 0, var = 1
DF_standard = pd.DataFrame(StandardScaler().fit_transform(DF_data), 
                           index = DF_data.index,
                           columns = DF_data.columns)

# Distance Matrix
Ar_dist = distance.squareform(distance.pdist(DF_data, metric="braycurtis")) # (n x n) distance measure
DM_dist = skbio.stats.distance.DistanceMatrix(Ar_dist, ids=DF_standard.index)
PCoA = skbio.stats.ordination.pcoa(DM_dist)

enter image description here

Now with sklearn's Multidimensional Scaling:

from sklearn.manifold import MDS
fig, ax=plt.subplots(ncols=5, figsize=(12,3))
for rs in range(5):
    M = MDS(n_components=2, metric=True, random_state=rs, dissimilarity='precomputed')
    A = M.fit(Ar_dist).embedding_
    ax[rs].scatter(A[:,0],A[:,1], c=[{0:"b", 1:"g", 2:"r"}[t] for t in Se_targets])

enter image description here

692

asked Aug 11 '16 20:08

O.rka

1 Answers

scikit-bio's PCoA (skbio.stats.ordination.pcoa) and scikit-learn's MDS (sklearn.manifold.MDS) use entirely different algorithms to transform the data. scikit-bio directly solves a symmetric eigenvalue problem and scikit-learn uses an iterative minimization procedure [1].

scikit-bio's PCoA is deterministic, though it is possible to receive different (arbitrary) rotations of the transformed coordinates depending on the system it is executed on [2]. scikit-learn's MDS is stochastic by default unless a fixed random_state is used. random_state appears to be used to initialize the iterative minimization procedure (the scikit-learn docs say that random_state is used to "initialize the centers" [3] though I don't know exactly what that means). Each random_state may produce slightly different embeddings with arbitrary rotation [4].

References: [1], [2], [3], [4]

176

answered Sep 24 '22 23:09

jairideout

Related questions
                            
                                Is it necessary to commit DVC files from our CI pipelines? [closed]
                            
                                While converting a PIL image into a tensor why the pixels are changing?
                            
                                Pause and resume caret training in R
                            
                                TensorFlow 2 Mask-RCNN? [closed]
                            
                                Multivariate Decision Tree learner
                            
                                Convergence criterion for (batch) SOM (Self-Organizing Map, aka "Kohonen Map")?
                            
                                Training neural network for XOR in Ruby
                            
                                Drawing shape context logpolar bins in MATLAB
                            
                                Similarity matrix -> feature vectors algorithm?
                            
                                What learning algorithm(s) should I consider to train a log-linear regression model?
                            
                                How to get the text of cluster centers from scikit-learn KMeans?
                            
                                CHAID analysis options for OS X / Python / R [closed]
                            
                                Updating the feature names into scikit TFIdfVectorizer
                            
                                R, Confusion Matrix in percent
                            
                                how to interpret the "soft" and "max" in the SoftMax regression?
                            
                                sklearn - model keeps overfitting
                            
                                How does having smaller values for parameters help in preventing over-fitting?
                            
                                OpenCL Theano - How to forcefully disable CUDA?
                            
                                Filtering and displaying values in GraphLab Sframe?
                            
                                How to plot ROC curve and precision-recall curve from BinaryClassificationMetrics

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is `sklearn.manifold.MDS` random when `skbio's pcoa` is not?

Tags:

machine-learning

scikit-learn

linear-algebra

multi-dimensional-scaling

skbio

O.rka

People also ask

1 Answers

jairideout

Recent Activity

Donate For Us