Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sparse implementations of distance computations in python / scikit-learn

I have a large (100K by 30K) and (very) sparse dataset in svmlight format which I load as follows:

import numpy as np
from scipy.cluster.vq import kmeans2
from scipy.spatial.distance import pdist, squareform
from sklearn.datasets import load_svmlight_file

X,Y = load_svmlight_file("somefile_svm.txt")

which returns a sparse scipy array X

I simply need to compute the pairwise distances of all training points as

D = pdist(X)

Unfortunately, distance computation implementations in scipy.spatial.distance work only for dense matrices. Due to the size of the dataset it is infeasible to, say, use pdist as

D = pdist(X.todense())

Any pointers to sparse matrix distance computation implementations or workarounds with regards to this problem will be greatly appreciated.

Many thanks

like image 613
Nicholas Avatar asked Jan 21 '12 20:01

Nicholas


1 Answers

In scikit-learn there is a sklearn.metrics.euclidean_distances function that works both for sparse matrices and dense numpy arrays. See the reference documentation.

However non-euclidean distances are not yet implemented for sparse matrices.

like image 54
ogrisel Avatar answered Sep 21 '22 09:09

ogrisel