I am fitting a k-nearest neighbors classifier using scikit learn and noticed that the fitting is faster, often by an order of magnitude or more, when using the cosine similarity between two vectors compared to when using the Euclidean similarity. Note that both of these are sklearn built ins; I am not using a custom implementation of either metric. What is the reason behind such a big discrepancy? I know scikit learn uses either a Ball tree or KD tree to compute the neighbor graph, but I'm not sure why the form of the metric would affect the run time of the algorithm. To quantify the effect, I performed a simulation experiment in which I fit a KNN to random data using either the euclidean or cosine metric, and recorded the run time in each case. The average run times in each case are shown below: <pre class="prettyprint lang-py prettyprint-override"><code>import numpy as np import time import pandas as pd from sklearn.neighbors import KNeighborsClassifier res=[] n_trials=10 for trial_id in range(n_trials): for n_pts in [100,300,1000,3000,10000,30000,100000]: for metric in ['cosine','euclidean']: knn=KNeighborsClassifier(n_neighbors=20,metric=metric) X=np.random.randn(n_pts,100) labs=np.random.choice(2,n_pts) starttime=time.time() knn.fit(X,labs) elapsed=time.time()-starttime res.append([elapsed,n_pts,metric,trial_id]) res=pd.DataFrame(res,columns=['time','size','metric','trial']) av_times=pd.pivot_table(res,index='size',columns='metric',values='time') print(av_times) </code></pre> <img src="https://i.stack.imgur.com/wU7Cj.png" alt="enter image description here"> Edit: These results are from a MacBook with version 0.21.3 of sklearn. I also duplicated the effect on a Ubuntu desktop machine with sklearn version 0.23.2.

Based on the comments I tried running the code with <code>algorithm='brute'</code> in the KNN and the Euclidean times sped up to match the cosine times. But trying <code>algorithm='kd_tree'</code>and <code>algorithm='ball_tree'</code> both throw errors, since apparently these algorithms do not accept cosine distance. So it looks like when the classifier is fit in <code>algorithm='auto'</code> mode, that it defaults to the brute force algorithm for a cosine metric, whereas for Euclidean distance it uses one of the other algorithms. Looking at the changelog, the difference between versions 0.23.2 and 0.24.2 presumably comes down to the following item: <blockquote> <code>neighbors.NeighborsBase</code> benefits of an improved <code>algorithm = 'auto'</code> heuristic. In addition to the previous set of rules, now, when the number of features exceeds 15, <code>brute</code> is selected, assuming the data intrinsic dimensionality is too high for tree-based methods. </blockquote> So it seems like the difference between the two did not have to do with the metric, but rather with the performance of a tree-based vs. a brute force search in high dimensions. For sufficiently high dimensions, tree-based searches may fail to outperform linear searches, so the runtime will be slower overall due to the additional overhead required to construct the data structure. In this case, the implmentation was forced to use the faster brute-force search in the cosine case because the tree-based algorithms do not work with cosine distance, but it (suboptimally) picked a tree-based algorithm in the Euclidean case. Looks like this behavior has been noticed and corrected in the latest version.

Why is KNN so much faster with cosine distance than Euclidean distance?

Tags:

performance

algorithm

machine-learning

scikit-learn

knn

I am fitting a k-nearest neighbors classifier using scikit learn and noticed that the fitting is faster, often by an order of magnitude or more, when using the cosine similarity between two vectors compared to when using the Euclidean similarity. Note that both of these are sklearn built ins; I am not using a custom implementation of either metric.

What is the reason behind such a big discrepancy? I know scikit learn uses either a Ball tree or KD tree to compute the neighbor graph, but I'm not sure why the form of the metric would affect the run time of the algorithm.

To quantify the effect, I performed a simulation experiment in which I fit a KNN to random data using either the euclidean or cosine metric, and recorded the run time in each case. The average run times in each case are shown below:

import numpy as np
import time
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
res=[]
n_trials=10
for trial_id in range(n_trials):
    for n_pts in [100,300,1000,3000,10000,30000,100000]:
        for metric in ['cosine','euclidean']:
            knn=KNeighborsClassifier(n_neighbors=20,metric=metric)
            X=np.random.randn(n_pts,100)
            labs=np.random.choice(2,n_pts)
            starttime=time.time()
            knn.fit(X,labs)
            elapsed=time.time()-starttime
            res.append([elapsed,n_pts,metric,trial_id])

res=pd.DataFrame(res,columns=['time','size','metric','trial'])
av_times=pd.pivot_table(res,index='size',columns='metric',values='time')
print(av_times)

enter image description here

Edit: These results are from a MacBook with version 0.21.3 of sklearn. I also duplicated the effect on a Ubuntu desktop machine with sklearn version 0.23.2.

777

asked May 23 '21 14:05

Mike Hawk

Video Answer

2 Answers

Based on the comments I tried running the code with algorithm='brute' in the KNN and the Euclidean times sped up to match the cosine times. But trying algorithm='kd_tree'and algorithm='ball_tree' both throw errors, since apparently these algorithms do not accept cosine distance. So it looks like when the classifier is fit in algorithm='auto' mode, that it defaults to the brute force algorithm for a cosine metric, whereas for Euclidean distance it uses one of the other algorithms. Looking at the changelog, the difference between versions 0.23.2 and 0.24.2 presumably comes down to the following item:

neighbors.NeighborsBase benefits of an improved algorithm = 'auto' heuristic. In addition to the previous set of rules, now, when the number of features exceeds 15, brute is selected, assuming the data intrinsic dimensionality is too high for tree-based methods.

So it seems like the difference between the two did not have to do with the metric, but rather with the performance of a tree-based vs. a brute force search in high dimensions. For sufficiently high dimensions, tree-based searches may fail to outperform linear searches, so the runtime will be slower overall due to the additional overhead required to construct the data structure. In this case, the implmentation was forced to use the faster brute-force search in the cosine case because the tree-based algorithms do not work with cosine distance, but it (suboptimally) picked a tree-based algorithm in the Euclidean case. Looks like this behavior has been noticed and corrected in the latest version.

answered Sep 28 '22 18:09

Mike Hawk

I've run your code snippet on Mac, sklearn 0.24.1, got :

metric    cosine  euclidean
size                       
100     0.000322   0.000165
300     0.000205   0.000186
1000    0.000273   0.000271
3000    0.000503   0.000531
10000   0.001459   0.001326
30000   0.002919   0.002784
100000  0.008977   0.008872

So it's probably an implementation issue that got fixed in v0.24.

answered Sep 28 '22 18:09

igrinis

Related questions
                            
                                Finding a number to xor with sequence elements to obtain given sum
                            
                                Dynamic Programming on Trees with Modifications
                            
                                Trie implementation with wildcard values
                            
                                Least Recently Used (LRU) Cache
                            
                                finding minimum number of rectangular pieces in a rectangular chocolate bar, with a rule
                            
                                Generating n binary vectors where each vector has a Hamming distance of d from every other vector
                            
                                Group array of items by their distinct id
                            
                                generate a random point within rectangles' areas uniformly (some rectangles could overlap)
                            
                                3Sum leetcode algorithm
                            
                                Peak finding algorithm in 2d-array with complexity O(n)
                            
                                Algorithm to find k smallest numbers in an array in same order using O(1) auxiliary space
                            
                                Any idea to optimise this algorithm?
                            
                                Sorted array except for first K and last K elements
                            
                                Sum of max elements in sub-triangles
                            
                                Building a tree recursively in JavaScript
                            
                                How to compute the min average sub-array better than O(n^2)? [duplicate]
                            
                                Progressively store the path from root node to node of multiway tree during insertion so that the storage operation does not have a complexity of O(n)
                            
                                Check if strings in a list can be formed by concatenation of elements in the same list
                            
                                Number of expressions of a given length
                            
                                Number of ways to change coins in constant time?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is KNN so much faster with cosine distance than Euclidean distance?

Tags:

performance

algorithm

machine-learning

scikit-learn

knn

Mike Hawk

People also ask

Video Answer

2 Answers

Mike Hawk

igrinis

Recent Activity

Donate For Us