I am fitting a k-nearest neighbors classifier using scikit learn and noticed that the fitting is faster, often by an order of magnitude or more, when using the cosine similarity between two vectors compared to when using the Euclidean similarity. Note that both of these are sklearn built ins; I am not using a custom implementation of either metric.
What is the reason behind such a big discrepancy? I know scikit learn uses either a Ball tree or KD tree to compute the neighbor graph, but I'm not sure why the form of the metric would affect the run time of the algorithm.
To quantify the effect, I performed a simulation experiment in which I fit a KNN to random data using either the euclidean or cosine metric, and recorded the run time in each case. The average run times in each case are shown below:
import numpy as np
import time
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
res=[]
n_trials=10
for trial_id in range(n_trials):
for n_pts in [100,300,1000,3000,10000,30000,100000]:
for metric in ['cosine','euclidean']:
knn=KNeighborsClassifier(n_neighbors=20,metric=metric)
X=np.random.randn(n_pts,100)
labs=np.random.choice(2,n_pts)
starttime=time.time()
knn.fit(X,labs)
elapsed=time.time()-starttime
res.append([elapsed,n_pts,metric,trial_id])
res=pd.DataFrame(res,columns=['time','size','metric','trial'])
av_times=pd.pivot_table(res,index='size',columns='metric',values='time')
print(av_times)
Edit: These results are from a MacBook with version 0.21.3 of sklearn. I also duplicated the effect on a Ubuntu desktop machine with sklearn version 0.23.2.
The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance because of the size (like, the word 'cricket' appeared 50 times in one document and 10 times in another) they could still have a smaller angle between them. Smaller the angle, higher the similarity.
However, in such circumstances, cosine similarity is bijective with Euclidean distance, so there's no real advantage to one over the other theoretically; in practice, cosine similarity is faster then.
The Euclidean distance corresponds to the L2-norm of a difference between vectors. The cosine similarity is proportional to the dot product of two vectors and inversely proportional to the product of their magnitudes.
Usually, the Euclidean distance is used as the distance metric. Then, it assigns the point to the class among its k nearest neighbours (where k is an integer).
Based on the comments I tried running the code with algorithm='brute'
in the KNN and the Euclidean times sped up to match the cosine times. But trying algorithm='kd_tree'
and algorithm='ball_tree'
both throw errors, since apparently these algorithms do not accept cosine distance. So it looks like when the classifier is fit in algorithm='auto'
mode, that it defaults to the brute force algorithm for a cosine metric, whereas for Euclidean distance it uses one of the other algorithms. Looking at the changelog, the difference between versions 0.23.2 and 0.24.2 presumably comes down to the following item:
neighbors.NeighborsBase
benefits of an improvedalgorithm = 'auto'
heuristic. In addition to the previous set of rules, now, when the number of features exceeds 15,brute
is selected, assuming the data intrinsic dimensionality is too high for tree-based methods.
So it seems like the difference between the two did not have to do with the metric, but rather with the performance of a tree-based vs. a brute force search in high dimensions. For sufficiently high dimensions, tree-based searches may fail to outperform linear searches, so the runtime will be slower overall due to the additional overhead required to construct the data structure. In this case, the implmentation was forced to use the faster brute-force search in the cosine case because the tree-based algorithms do not work with cosine distance, but it (suboptimally) picked a tree-based algorithm in the Euclidean case. Looks like this behavior has been noticed and corrected in the latest version.
I've run your code snippet on Mac, sklearn
0.24.1, got :
metric cosine euclidean
size
100 0.000322 0.000165
300 0.000205 0.000186
1000 0.000273 0.000271
3000 0.000503 0.000531
10000 0.001459 0.001326
30000 0.002919 0.002784
100000 0.008977 0.008872
So it's probably an implementation issue that got fixed in v0.24.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With