Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tuning leaf_size to decrease time consumption in Scikit-Learn KNN

I was trying to implement KNN for handwritten character recognition where I found out that the execution of code was taking a lot of time. When added parameter leaf_size with value 400, I observed that time taken by code to execute was significantly reduced.

Original Code:

knn = KNeighborsClassifier(n_neighbors=3)

New Code:

knn = KNeighborsClassifier(n_neighbors=3,leaf_size=400)

I have read few documentation and articles regarding the leaf_size parameter of the KDtree/Balltree but couldn't find any good enough reference on how to safely tune this parameter without any accuracy and information loss.

It would be very kind if someone could share some insights regarding the above issue.

Related Articles I referred to:
1.) http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html
2.) https://jakevdp.github.io/blog/2013/04/29/benchmarking-nearest-neighbor-searches-in-python/
3.) http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

like image 650
Harshit Saini Avatar asked Apr 21 '18 08:04

Harshit Saini


1 Answers

couldn't find any good enough reference on how to safely tune this parameter without any accuracy and information loss.

In general, the larger leaf_size, the closer neighbors the algorithm picks (note this is not the same as higher accuracy). This is because the tree's main purpose is to reduce the number of candidates for the neighbours. Neither KD-tree, nor Ball Tree guarantees that the split will keep true nearest points close in a tree. In this sense, the use of a tree implies "information loss", and the larger the tree, the higher the chance to put the true neighbors into a wrong branch. In extreme case, there's simply no tree at all (called brute force), so all points are the candidates for the neighbours, as a result the algorithm is guaranteed to find the exact solution.

However, it's possible, at least in theory, that even the wrong choice of the neighbors actually results in higher classification accuracy. But it's almost impossible to say in advance how the change of leaf_size affects the final accuracy.

When added parameter leaf_size with value 400, I observed that time taken by code to execute was significantly reduced.

This is interesting, because leaf_size increase (default is 30) usually results in query time reduction. But it is possible that most of that time was spent on construction, in this case the larger the size of a leaf, the faster this phase is completed. Remember, that the Ball Tree doesn't guarantee that the result tree is balanced. When it's highly unbalanced the construction can even take O(N^2) time. In this case the leaf_size increase makes perfect sense. This post contains very interesting experiments on this matter.

like image 68
Maxim Avatar answered Nov 16 '22 10:11

Maxim