I'm trying to figure out how to implement Principal Coordinate Analysis
with various distance metrics. I stumbled across both skbio
and sklearn
with implementations. I don't understand why sklearn
's implementation is different everytime while skbio
is the same? Is there a degree of randomness to Multidimensional Scaling
and in particular Principal Coordinate Analysis
? I see that all of the clusters are very similar but why are they different? Am I implementing this correctly?
Running Principal Coordinate Analysis
using Scikit-bio
(i.e. Skbio
) always gives the same results:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn import decomposition
import seaborn as sns; sns.set_style("whitegrid", {'axes.grid' : False})
import skbio
from scipy.spatial import distance
%matplotlib inline
np.random.seed(0)
# Iris dataset
DF_data = pd.DataFrame(load_iris().data,
index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
columns = load_iris().feature_names)
n,m = DF_data.shape
# print(n,m)
# 150 4
Se_targets = pd.Series(load_iris().target,
index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
name = "Species")
# Scaling mean = 0, var = 1
DF_standard = pd.DataFrame(StandardScaler().fit_transform(DF_data),
index = DF_data.index,
columns = DF_data.columns)
# Distance Matrix
Ar_dist = distance.squareform(distance.pdist(DF_data, metric="braycurtis")) # (n x n) distance measure
DM_dist = skbio.stats.distance.DistanceMatrix(Ar_dist, ids=DF_standard.index)
PCoA = skbio.stats.ordination.pcoa(DM_dist)
Now with sklearn
's Multidimensional Scaling
:
from sklearn.manifold import MDS
fig, ax=plt.subplots(ncols=5, figsize=(12,3))
for rs in range(5):
M = MDS(n_components=2, metric=True, random_state=rs, dissimilarity='precomputed')
A = M.fit(Ar_dist).embedding_
ax[rs].scatter(A[:,0],A[:,1], c=[{0:"b", 1:"g", 2:"r"}[t] for t in Se_targets])
MDS is a probabalistic algorithm, there is a parameter random_state that you can use to fix the random seed, you can pass this if you want to get the same results each time. PCA on the other hand is a deterministic algorithm, if you use sklearn. decomposition. PCA , you should get the same results each time.
Multi-dimensional scaling (MDS) is a statistical technique that allows researchers to find and explore underlying themes, or dimensions, in order to explain similarities or dissimilarities (i.e. distances) between investigated datasets.
scikit-bio's PCoA (skbio.stats.ordination.pcoa
) and scikit-learn's MDS (sklearn.manifold.MDS
) use entirely different algorithms to transform the data. scikit-bio directly solves a symmetric eigenvalue problem and scikit-learn uses an iterative minimization procedure [1].
scikit-bio's PCoA is deterministic, though it is possible to receive different (arbitrary) rotations of the transformed coordinates depending on the system it is executed on [2]. scikit-learn's MDS is stochastic by default unless a fixed random_state
is used. random_state
appears to be used to initialize the iterative minimization procedure (the scikit-learn docs say that random_state
is used to "initialize the centers" [3] though I don't know exactly what that means). Each random_state
may produce slightly different embeddings with arbitrary rotation [4].
References: [1], [2], [3], [4]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With