Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does scikit-learn's Nearest Neighbor doesn't seem to return proper cosine similarity distances?

I am trying to use scikit's Nearest Neighbor implementation to find the closest column vectors to a given column vector, out of a matrix of random values.

This code is supposed to find the nearest neighbors of column 21 then check the actual cosine similarity of those neighbors against column 21.

from sklearn.neighbors import NearestNeighbors
import sklearn.metrics.pairwise as smp
import numpy as np

test=np.random.randint(0,5,(50,50))
nbrs = NearestNeighbors(n_neighbors=5, algorithm='auto', metric=smp.cosine_similarity).fit(test)
distances, indices = nbrs.kneighbors(test)

x=21   

for idx,d in enumerate(indices[x]):

    sim2 = smp.cosine_similarity(test[:,x],test[:,d])


    print "sklearns cosine similarity would be ", sim2
    print 'sklearns reported distance is', distances[x][idx]
    print 'sklearns if that distance was cosine, the similarity would be: ' ,1- distances[x][idx]

Output looks like

sklearns cosine similarity would be  [[ 0.66190748]]
sklearns reported distance is 0.616586738214
sklearns if that distance was cosine, the similarity would be:  0.383413261786

So the output of kneighbors is neither the cosine distance or the cosine similarity. What gives?

Also, as an aside, I thought sklearn's Nearest Neighbors implementation was not an Approximate Nearest Neighbors approach, yet it doesn't seem to detect the actual best neighbors in my dataset, compared to the results I get if i iterate over the matrix and check the similarities of column 211 to all the other ones. Am I misunderstanding something basic here?

like image 256
pplat Avatar asked Apr 12 '14 15:04

pplat


People also ask

Can we use cosine similarity in Knn?

Cosine similarity is used as a metric in different machine learning algorithms like the KNN for determining the distance between the neighbors, in recommendation systems, it is used to recommend movies with the same similarities and for textual data, it is used to find the similarity of texts in the document.

Is cosine similarity a distance metric?

Cosine distance (or 1 - cosine similarity) is the distance you might have encountered when you are working with vectors. Unfortunately, cosine distance is not a 'true' metric.

How do you find cosine similarity in Python?

We use the below formula to compute the cosine similarity. where A and B are vectors: A.B is dot product of A and B: It is computed as sum of element-wise product of A and B. ||A|| is L2 norm of A: It is computed as square root of the sum of squares of elements of the vector A.

What does cosine similarity measure?

2.4. Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. It is often used to measure document similarity in text analysis.


1 Answers

Ok the problem was that NearestNeighbors's .fit() method, by default assumes the rows are samples and the columns are features. I had to tranpose the matrix before passing it to fit.

EDIT: Also, another problem is that the callable passed as metric should be a distance callable, not a similarity callable. Otherwise you'll get the K Farthest Neighbors :/

like image 154
pplat Avatar answered Sep 20 '22 04:09

pplat