I am looking at the Wikipedia page for KD trees. As an example, I implemented, in python, the algorithm for building a kd tree listed.
The algorithm for doing KNN search with a KD tree, however, switches languages and isn't totally clear. The English explanation starts making sense, but parts of it (such as the area where they "unwind recursion" to check other leaf nodes) don't really make any sense to me.
How does this work, and how can one do a KNN search with a KD tree in python? This isn't meant to be a "send me the code!"
type question, and I don't expect that. Just a brief explanation please :)
A K-D Tree(also called as K-Dimensional Tree) is a binary search tree where data in each node is a K-Dimensional point in space. In short, it is a space partitioning(details below) data structure for organizing points in a K-Dimensional space.
The KD Tree Algorithm is one of the most commonly used Nearest Neighbor Algorithms. The data points are split at each node into two sets. Like the previous algorithm, the KD Tree is also a binary tree algorithm always ending in a maximum of two nodes. The split criteria chosen are often the median.
What is the run time of finding the nearest neighbour in a k-d tree? Explanation: The run time of finding the nearest neighbour in a kd tree is given as O(2d log N) where 2d is the time taken to search the neighbourhood.
KD-trees are a specific data structure for efficiently representing our data. In particular, KD-trees helps organize and partition the data points based on specific conditions.
Build a 2d-tree from a labeled 2D training dataset (points marked with red or blue represent 2 different class labels ). For a query point (new test point with unknown class label) run k-nearest neighbor search on the 2d-tree with the query point (for a fixed value of k, e.g., 3).
If we have to construct a k-d tree with 20 dimensional data sets, we got to have around 2 20 data points. If we don’t have enough data then at many levels we will not have sufficient data to split. We will also end up with an unbalanced k-d tree where search and other operations would not be very efficient.
A k-d tree is a data structure that partitions space by repeatedly choosing a point in the data set to split the current partition in half. It accomplishes this by alternating the dimension which performs the split.
k-nearest neighbors search : This method returns the k points that are closest to the query point (in any order); return all n points in the data structure if n ≤ k. It must do this in an efficient manner, i.e. using the technique from kd-tree nearest neighbor search, not from brute force.
This book introduction, page 3:
Given a set of n points in a d-dimensional space, the kd-tree is constructed recursively as follows. First, one finds a median of the values of the ith coordinates of the points (initially, i = 1). That is, a value M is computed, so that at least 50% of the points have their ith coordinate greater-or-equal to M, while at least 50% of the points have their ith coordinate smaller than or equal to M. The value of x is stored, and the set P is partitioned into PL and PR , where PL contains only the points with their ith coordinate smaller than or equal to M, and |PR | = |PL |±1. The process is then repeated recursively on both PL and PR , with i replaced by i + 1 (or 1, if i = d). When the set of points at a node has size 1, the recursion stops.
The following paragraphs discuss its use in solving nearest neighbor.
Or, here is the original 1975 paper by Jon Bentley.
EDIT: I should add that SciPy has a kdtree implementation:
I've just spend some time puzzling out the Wikipedia description of the algorithm myself, and came up with the following Python implementation that may help: https://gist.github.com/863301
The first phase of closest_point
is a simple depth first search to find the best matching leaf node.
Instead of simply returning the best node found back up the call stack, a second phase checks to see if there could be a closer node on the "away" side: (ASCII art diagram)
n current node
b | best match so far
| p | point we're looking for
|< >| | error
|< >| distance to "away" side
|< | >| error "sphere" extends to "away" side
| x possible better match on the "away" side
The current node n
splits the space along a line, so we only need to look on the "away" side if the "error" between the point p
and the best match b
is greater than the distance from point p
and the line though n
. If it is, then we check to see if there are any points on the "away" side that are closer.
Because our best matching node is passed into this second test, it doesn't have to do a full traversal of the branch and will stop pretty quickly if it's on the wrong track (only heading down the "near" child nodes until it hits a leaf.)
To compute the distance between the point p
and the line splitting the space through the node n
, we can simply "project" the point down onto the axis by copying the appropriate coordinate as the axes are all orthogonal (horizontal or vertical).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With