Why does scikit-learn's Nearest Neighbor doesn't seem to return proper cosine similarity distances?

Tags:

I am trying to use scikit's Nearest Neighbor implementation to find the closest column vectors to a given column vector, out of a matrix of random values.

This code is supposed to find the nearest neighbors of column 21 then check the actual cosine similarity of those neighbors against column 21.

from sklearn.neighbors import NearestNeighbors
import sklearn.metrics.pairwise as smp
import numpy as np

test=np.random.randint(0,5,(50,50))
nbrs = NearestNeighbors(n_neighbors=5, algorithm='auto', metric=smp.cosine_similarity).fit(test)
distances, indices = nbrs.kneighbors(test)

x=21   

for idx,d in enumerate(indices[x]):

    sim2 = smp.cosine_similarity(test[:,x],test[:,d])


    print "sklearns cosine similarity would be ", sim2
    print 'sklearns reported distance is', distances[x][idx]
    print 'sklearns if that distance was cosine, the similarity would be: ' ,1- distances[x][idx]

Output looks like

sklearns cosine similarity would be  [[ 0.66190748]]
sklearns reported distance is 0.616586738214
sklearns if that distance was cosine, the similarity would be:  0.383413261786

So the output of kneighbors is neither the cosine distance or the cosine similarity. What gives?

Also, as an aside, I thought sklearn's Nearest Neighbors implementation was not an Approximate Nearest Neighbors approach, yet it doesn't seem to detect the actual best neighbors in my dataset, compared to the results I get if i iterate over the matrix and check the similarities of column 211 to all the other ones. Am I misunderstanding something basic here?

256

asked Apr 12 '14 15:04

pplat

1 Answers

Ok the problem was that NearestNeighbors's .fit() method, by default assumes the rows are samples and the columns are features. I had to tranpose the matrix before passing it to fit.

EDIT: Also, another problem is that the callable passed as metric should be a distance callable, not a similarity callable. Otherwise you'll get the K Farthest Neighbors :/

154

answered Sep 20 '22 04:09

pplat

Related questions
                            
                                Adding sublists elements based on indexing by condition in python
                            
                                Is it costly in Python to put classes in different files?
                            
                                Splitting a list by first character of each element
                            
                                PyQt4 what is the best way to center dialog windows?
                            
                                can't install scipy on mac OS X
                            
                                scipy optimize.curve_fit cannot fit a function whose return value depends on a conditional
                            
                                GAE doesn't import gflags
                            
                                When I run the full test suite in Django, I get errors about missing MessageMiddleware
                            
                                Detecting the end of the stream on popen.stdout.readline
                            
                                Why does overriding __contains__ break OrderedDict.keys?
                            
                                How to eliminate a python3 deprecation warning for the equality operator?
                            
                                Remove special characters from csv file using python
                            
                                Python/Django 1.5 DatabaseWrapper thread error
                            
                                Serve image from GAE datastore with Flask (python)
                            
                                Parsing User Defined Types Using PyArg_ParseTuple
                            
                                Python: How can I use ggplot with a simple 2 column array?
                            
                                Weird lambda behaviour in loops [duplicate]
                            
                                Aptana: Exclude files when deploying a project to App Engine
                            
                                Writing (and not) to global variable in Python
                            
                                Does assertRaises (or assert_raises) exist in nose2

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does scikit-learn's Nearest Neighbor doesn't seem to return proper cosine similarity distances?

Tags:

python-2.7

scikit-learn

cosine-similarity

nearest-neighbor

pplat

People also ask

1 Answers

pplat

Recent Activity

Donate For Us