Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multidimensional Scaling Fitting in Numpy, Pandas and Sklearn (ValueError)

I'm trying out multidimensional scaling with sklearn, pandas and numpy. The data file Im using has 10 numerical columns and no missing values. I am trying to take this ten dimensional data and visualize it in 2 dimensions with sklearn.manifold's multidimensional scaling as follows:

import numpy as np
import pandas as pd
from sklearn import manifold
from sklearn.metrics import euclidean_distances

seed = np.random.RandomState(seed=3)
data = pd.read_csv('data/big-file.csv')

#  start small dont take all the data, 
#  its about 200k records
subset = data[:10000]
similarities = euclidean_distances(subset)

mds = manifold.MDS(n_components=2, max_iter=3000, eps=1e-9, 
      random_state=seed, dissimilarity="precomputed", n_jobs=1)

pos = mds.fit(similarities).embedding_

But I get this value error:

Traceback (most recent call last):
  File "demo/mds-demo.py", line 18, in <module>
    pos = mds.fit(similarities).embedding_
  File "/Users/dwilliams/Desktop/Anaconda/lib/python2.7/site-packages/sklearn/manifold/mds.py", line 360, in fit
    self.fit_transform(X, init=init)
  File "/Users/dwilliams/Desktop/Anaconda/lib/python2.7/site-packages/sklearn/manifold/mds.py", line 395, in fit_transform
eps=self.eps, random_state=self.random_state)
  File "/Users/dwilliams/Desktop/Anaconda/lib/python2.7/site-packages/sklearn/manifold/mds.py", line 242, in smacof
eps=eps, random_state=random_state)
  File "/Users/dwilliams/Desktop/Anaconda/lib/python2.7/site-packages/sklearn/manifold/mds.py", line 73, in _smacof_single
raise ValueError("similarities must be symmetric")
ValueError: similarities must be symmetric

I thought euclidean_distances returned a symmetric matrix. What am I doing wrong and how do I fix it?

like image 819
David Williams Avatar asked Jun 07 '13 18:06

David Williams


2 Answers

I ran across the same problem; it turned out that my data was an array of np.float32 and the reduced float precision caused the distance matrix to be asymmetric. I fixed the issue by converting my data to np.float64 before running MDS on it.

Here's an example that uses random data to illustrate the issue:

import numpy as np
from sklearn.manifold import MDS
from sklearn.metrics import euclidean_distances
from sklearn.datasets import make_classification

data, labels = make_classification()
mds = MDS(n_components=2)

similarities = euclidean_distances(data.astype(np.float64))
print np.abs(similarities - similarities.T).max()
# Prints 1.7763568394e-15
mds.fit(data.astype(np.float64))
# Succeeds

similarities = euclidean_distances(data.astype(np.float32))
print np.abs(similarities - similarities.T).max()
# Prints 9.53674e-07
mds.fit(data.astype(np.float32))
# Fails with "ValueError: similarities must be symmetric"
like image 145
Josh Rosen Avatar answered Nov 07 '22 03:11

Josh Rosen


Had the same problem a while ago. Another solution, which I believe much more efficient, is to compute the distance only for the upper triangular matrix, and later copy to the lower part.

It can be done with scipy as follows:

from scipy.spatial.distance import squareform,pdist                                                              
similarities = squareform(pdist(data,'speuclidean'))
like image 24
memecs Avatar answered Nov 07 '22 03:11

memecs