Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is KNN so much faster with cosine distance than Euclidean distance?

I am fitting a k-nearest neighbors classifier using scikit learn and noticed that the fitting is faster, often by an order of magnitude or more, when using the cosine similarity between two vectors compared to when using the Euclidean similarity. Note that both of these are sklearn built ins; I am not using a custom implementation of either metric.

What is the reason behind such a big discrepancy? I know scikit learn uses either a Ball tree or KD tree to compute the neighbor graph, but I'm not sure why the form of the metric would affect the run time of the algorithm.

To quantify the effect, I performed a simulation experiment in which I fit a KNN to random data using either the euclidean or cosine metric, and recorded the run time in each case. The average run times in each case are shown below:

import numpy as np
import time
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
res=[]
n_trials=10
for trial_id in range(n_trials):
    for n_pts in [100,300,1000,3000,10000,30000,100000]:
        for metric in ['cosine','euclidean']:
            knn=KNeighborsClassifier(n_neighbors=20,metric=metric)
            X=np.random.randn(n_pts,100)
            labs=np.random.choice(2,n_pts)
            starttime=time.time()
            knn.fit(X,labs)
            elapsed=time.time()-starttime
            res.append([elapsed,n_pts,metric,trial_id])

res=pd.DataFrame(res,columns=['time','size','metric','trial'])
av_times=pd.pivot_table(res,index='size',columns='metric',values='time')
print(av_times)

enter image description here

Edit: These results are from a MacBook with version 0.21.3 of sklearn. I also duplicated the effect on a Ubuntu desktop machine with sklearn version 0.23.2.

like image 777
Mike Hawk Avatar asked May 23 '21 14:05

Mike Hawk


People also ask

Why cosine distance is better than Euclidean distance?

The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance because of the size (like, the word 'cricket' appeared 50 times in one document and 10 times in another) they could still have a smaller angle between them. Smaller the angle, higher the similarity.

Is cosine similarity faster than Euclidean distance?

However, in such circumstances, cosine similarity is bijective with Euclidean distance, so there's no real advantage to one over the other theoretically; in practice, cosine similarity is faster then.

What is the difference between Euclidean distance and cosine distance?

The Euclidean distance corresponds to the L2-norm of a difference between vectors. The cosine similarity is proportional to the dot product of two vectors and inversely proportional to the product of their magnitudes.

Which distance metric is best for KNN?

Usually, the Euclidean distance is used as the distance metric. Then, it assigns the point to the class among its k nearest neighbours (where k is an integer).


Video Answer


2 Answers

Based on the comments I tried running the code with algorithm='brute' in the KNN and the Euclidean times sped up to match the cosine times. But trying algorithm='kd_tree'and algorithm='ball_tree' both throw errors, since apparently these algorithms do not accept cosine distance. So it looks like when the classifier is fit in algorithm='auto' mode, that it defaults to the brute force algorithm for a cosine metric, whereas for Euclidean distance it uses one of the other algorithms. Looking at the changelog, the difference between versions 0.23.2 and 0.24.2 presumably comes down to the following item:

neighbors.NeighborsBase benefits of an improved algorithm = 'auto' heuristic. In addition to the previous set of rules, now, when the number of features exceeds 15, brute is selected, assuming the data intrinsic dimensionality is too high for tree-based methods.

So it seems like the difference between the two did not have to do with the metric, but rather with the performance of a tree-based vs. a brute force search in high dimensions. For sufficiently high dimensions, tree-based searches may fail to outperform linear searches, so the runtime will be slower overall due to the additional overhead required to construct the data structure. In this case, the implmentation was forced to use the faster brute-force search in the cosine case because the tree-based algorithms do not work with cosine distance, but it (suboptimally) picked a tree-based algorithm in the Euclidean case. Looks like this behavior has been noticed and corrected in the latest version.

like image 77
Mike Hawk Avatar answered Sep 28 '22 18:09

Mike Hawk


I've run your code snippet on Mac, sklearn 0.24.1, got :

metric    cosine  euclidean
size                       
100     0.000322   0.000165
300     0.000205   0.000186
1000    0.000273   0.000271
3000    0.000503   0.000531
10000   0.001459   0.001326
30000   0.002919   0.002784
100000  0.008977   0.008872

So it's probably an implementation issue that got fixed in v0.24.

like image 33
igrinis Avatar answered Sep 28 '22 18:09

igrinis