Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scipy Sparse - distance matrix (Scikit or Scipy)

I am trying to compute nearest neighbour clustering on a Scipy sparse matrix returned from scikit-learn's DictVectorizer. However, when I try to compute the distance matrix with scikit-learn I get an error message using 'euclidean' distance through both pairwise.euclidean_distances and pairwise.pairwise_distances. I was under the impression that scikit-learn could calculate these distance matrices.

My matrix is highly sparse with a shape of: <364402x223209 sparse matrix of type <class 'numpy.float64'> with 728804 stored elements in Compressed Sparse Row format>.

I have also tried methods such as pdist and kdtree in Scipy but have received other errors of not being able to process the result.

Can anyone please point me to a solution that would effectively allow me calculate the distance matrix and/or the nearest neighbour result?

Some example code:

import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import pairwise
import scipy.spatial

file = 'FileLocation'
data = []
FILE = open(file,'r')
for line in FILE:
    templine = line.strip().split(',')
    data.append({'user':str(int(templine[0])),str(int(templine[1])):int(templine[2])})
FILE.close()

vec = DictVectorizer()
X = vec.fit_transform(data)

result = scipy.spatial.KDTree(X)

Error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/scipy/spatial/kdtree.py", line 227, in __init__
    self.n, self.m = np.shape(self.data)
ValueError: need more than 0 values to unpack

Similarly, if I run:

scipy.spatial.distance.pdist(X,'euclidean')

I get the following:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/scipy/spatial/distance.py", line 1169, in pdist
    [X] = _copy_arrays_if_base_present([_convert_to_double(X)])
  File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/scipy/spatial/distance.py", line 113, in _convert_to_double
    X = X.astype(np.double)
ValueError: setting an array element with a sequence.

Finally, running NearestNeighbor in scikit-learn results in a memory error using:

nbrs = NearestNeighbors(n_neighbors=10, algorithm='brute')
like image 943
user2694306 Avatar asked Jan 13 '14 07:01

user2694306


2 Answers

First, you can't use KDTree and pdist with sparse matrix, you have to convert it to dense (your choice whether it's your option):

>>> X
<2x3 sparse matrix of type '<type 'numpy.float64'>'
        with 4 stored elements in Compressed Sparse Row format>

>>> scipy.spatial.KDTree(X.todense())
<scipy.spatial.kdtree.KDTree object at 0x34d1e10>
>>> scipy.spatial.distance.pdist(X.todense(),'euclidean')
array([ 6.55743852])

Second, from the docs:

Efficient brute-force neighbors searches can be very competitive for small data samples. However, as the number of samples N grows, the brute-force approach quickly becomes infeasible.

You might want to try 'ball_tree' algorithm and see if it can handle your data.

like image 110
alko Avatar answered Nov 18 '22 21:11

alko


From your comment:

Since it is a sparse matrix, I would expect there to be solutions to intelligently calculate the distances and store the result in a similarly sparse matrix.

Basic math shows that this is only possible in the case that your input matrix contains a massive number of duplicates, because Euclidean distance is only zero for two exactly equal points (this is actually one of the axioms of distance). So if you remove duplicates this might work.

Otherwise, depending on your problem, you might be able to use sklearn.metrics.pairwise_distances_argmin_min or cosine similarity, X * X.T, which has the reverse ordering compared to Euclidean distance.

like image 37
Fred Foo Avatar answered Nov 18 '22 19:11

Fred Foo