Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Incremental Nearest Neighbor Algorithm in Python [closed]

Is anyone aware of a nearest neighbor algorithm implemented in Python that can be updated incrementally? All the ones I've found, such as this one, appear to be batch processes. Is it possible to implement an incremental NN algorithm?

like image 251
Cerin Avatar asked Nov 25 '10 06:11

Cerin


People also ask

What is drawback of the neighbors algorithm?

It's main disadvantages are that it is quite computationally inefficient and its difficult to pick the “correct” value of K. However, the advantages of this algorithm is that it is versatile to different calculations of proximity, it's very intuitive and that it's a memory based approach.

How does Python implement KNN from scratch?

Building out the KNN Framework Use the distance function to get the distance between a test point and all known data points. Sort distance measurements to find the points closest to the test point (i.e., find the nearest neighbors) Use majority class labels of those closest points to predict the label of the test point.

Is KNN supervised or unsupervised?

The k-nearest neighbors (KNN) algorithm is a simple, supervised machine learning algorithm that can be used to solve both classification and regression problems. It's easy to implement and understand, but has a major drawback of becoming significantly slows as the size of that data in use grows.


2 Answers

This is way late, but for posterity:

There is actually a technique for converting batch-processed algorithms like KD-Tree into incremental algorithms: it's called a static-to-dynamic transformation.

To generate an incremental variant of a KD-Tree, you store a set of trees instead of just one tree. When there are N elements in your nearest-neighbor structure, your structure will have a tree for each "1" bit in the binary representation of N. Moreover, if tree T_i corresponds to the i-th bit of N, then tree T_i contains 2^i elements.

So, if you have 11 elements in your structure, then N = 11, or 1011 in binary, and therefore you have three trees - T_3, T_1, and T_0 - with 8 elements, 2 elements, and 1 element, respectively.

Now, let's insert an element e into our structure. After insertion, we'll have 12 elements, or 1100 in binary. Comparing the new and the previous binary string, we see that T_3 doesn't change, we have a new tree T_2 with 4 elements, and trees T_1 and T_0 get deleted. We construct the new tree T_2 by doing a batch insertion of e along with all the elements in the trees "below" T_2, which are T_1 and T_0.

In this way, we create an incremental point query structure from a static base structure. There is, however, an asymptotic slowdown in "incrementalizing" static structures like this in the form of an extra log(N) factor:

  • inserting N elements in structure: O(N log(N) log(n))
  • nearest neighbor query for structure with N elements: O(log(n) log(n))
like image 178
giogadi Avatar answered Oct 13 '22 00:10

giogadi


I think the problem with incremental construction of a KD-tree or KNN-tree is, as you've alluded to in a comment, that the tree will eventually become unbalanced and you can't do simple tree rotation to fix balance problems and keep consistency. At the minimum, the re-balancing task is not trivial and one would definitely not want to do it at each insertion. Often, one will choose to build a tree with a batch method, insert a bunch of new points and allow the tree to become unbalanced up to a point, and then re-balance it.

A very similar thing to do is to build the data structure in batch for M points, use it for M' points, and then re-build the data structure in batch with M+M' points. Since re-balancing is not normal, fast algorithm we are familiar with for trees, rebuilding is not necessarily slow in comparison and in some cases can be faster (depending on how the sequence of the points entering your incremental algorithm).

That being said, the amount of code you write, debugging difficulty, and the ease of others' understanding of your code can be significantly smaller if you take the rebuild approach. If you do so, you can use a batch method and keep an external list of points not yet inserted into the tree. A brute force approach can be used to ensure none of these is closer than the ones in the tree.

Some links to Python implementations/discussions are below, but I haven't found any that explicitly claim to be incremental. Good luck.

http://www.scipy.org/Cookbook/KDTree

http://cgi.di.uoa.gr/~compgeom/pycgalvisual/kdppython.shtml

http://sites.google.com/site/mikescoderama/Home/kd-tree-knn

http://en.wikipedia.org/wiki/Kd-tree

Note: My comments here apply to high-dimensional spaces. If you're working in 2D or 3D, what I've said may not be appropriate. (If you working in very high dimensional spaces, use brute force or approximate nearest neighbor.)

like image 22
RandomGuy Avatar answered Oct 12 '22 22:10

RandomGuy