I'm trying out multidimensional scaling with sklearn, pandas and numpy. The data file Im using has 10 numerical columns and no missing values. I am trying to take this ten dimensional data and visualize it in 2 dimensions with sklearn.manifold's multidimensional scaling as follows:
import numpy as np
import pandas as pd
from sklearn import manifold
from sklearn.metrics import euclidean_distances
seed = np.random.RandomState(seed=3)
data = pd.read_csv('data/big-file.csv')
# start small dont take all the data,
# its about 200k records
subset = data[:10000]
similarities = euclidean_distances(subset)
mds = manifold.MDS(n_components=2, max_iter=3000, eps=1e-9,
random_state=seed, dissimilarity="precomputed", n_jobs=1)
pos = mds.fit(similarities).embedding_
But I get this value error:
Traceback (most recent call last):
File "demo/mds-demo.py", line 18, in <module>
pos = mds.fit(similarities).embedding_
File "/Users/dwilliams/Desktop/Anaconda/lib/python2.7/site-packages/sklearn/manifold/mds.py", line 360, in fit
self.fit_transform(X, init=init)
File "/Users/dwilliams/Desktop/Anaconda/lib/python2.7/site-packages/sklearn/manifold/mds.py", line 395, in fit_transform
eps=self.eps, random_state=self.random_state)
File "/Users/dwilliams/Desktop/Anaconda/lib/python2.7/site-packages/sklearn/manifold/mds.py", line 242, in smacof
eps=eps, random_state=random_state)
File "/Users/dwilliams/Desktop/Anaconda/lib/python2.7/site-packages/sklearn/manifold/mds.py", line 73, in _smacof_single
raise ValueError("similarities must be symmetric")
ValueError: similarities must be symmetric
I thought euclidean_distances returned a symmetric matrix. What am I doing wrong and how do I fix it?
I ran across the same problem; it turned out that my data was an array of np.float32
and the reduced float precision caused the distance matrix to be asymmetric. I fixed the issue by converting my data to np.float64
before running MDS on it.
Here's an example that uses random data to illustrate the issue:
import numpy as np
from sklearn.manifold import MDS
from sklearn.metrics import euclidean_distances
from sklearn.datasets import make_classification
data, labels = make_classification()
mds = MDS(n_components=2)
similarities = euclidean_distances(data.astype(np.float64))
print np.abs(similarities - similarities.T).max()
# Prints 1.7763568394e-15
mds.fit(data.astype(np.float64))
# Succeeds
similarities = euclidean_distances(data.astype(np.float32))
print np.abs(similarities - similarities.T).max()
# Prints 9.53674e-07
mds.fit(data.astype(np.float32))
# Fails with "ValueError: similarities must be symmetric"
Had the same problem a while ago. Another solution, which I believe much more efficient, is to compute the distance only for the upper triangular matrix, and later copy to the lower part.
It can be done with scipy as follows:
from scipy.spatial.distance import squareform,pdist
similarities = squareform(pdist(data,'speuclidean'))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With