I have a large (100K by 30K) and (very) sparse dataset in svmlight format which I load as follows:
import numpy as np
from scipy.cluster.vq import kmeans2
from scipy.spatial.distance import pdist, squareform
from sklearn.datasets import load_svmlight_file
X,Y = load_svmlight_file("somefile_svm.txt")
which returns a sparse scipy array X
I simply need to compute the pairwise distances of all training points as
D = pdist(X)
Unfortunately, distance computation implementations in scipy.spatial.distance work only for dense matrices. Due to the size of the dataset it is infeasible to, say, use pdist as
D = pdist(X.todense())
Any pointers to sparse matrix distance computation implementations or workarounds with regards to this problem will be greatly appreciated.
Many thanks
In scikit-learn
there is a sklearn.metrics.euclidean_distances
function that works both for sparse matrices and dense numpy arrays. See the reference documentation.
However non-euclidean distances are not yet implemented for sparse matrices.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With