Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why k-d trees is not used for high dimensional data?

Tags:

algorithm

Quoting Wikipedia on k-d tree's page:

k-d trees are not suitable for efficiently finding the nearest neighbour in high-dimensional spaces. As a general rule, if the dimensionality is k, the number of points in the data, N, should be N>> 2k. Otherwise, when k-d trees are used with high-dimensional data, most of the points in the tree will be evaluated and the efficiency is no better than exhaustive search,[11] and approximate nearest-neighbour methods should be used instead.

I don't understand the difference between the dimensionality (k) and the number of points in the data (N) and why it's true the statement about when k-d trees are not convenient.

like image 403
justHelloWorld Avatar asked May 10 '16 08:05

justHelloWorld


1 Answers

k is the dimensionality of your data, whereas n is the number of points in your data set. So if your data set consists of 10 million points and each point has 3 dimensions, k is 3 and n is 10 million.

The reason that k-d trees are unsuitable for finding nearest neighbours in high dimensions is related to the so-called curse of dimensionality. A k-d tree repeatedly uses a split along a single dimension, but when dealing with high-dimensional data, knowing something about the (Euclidian) distance in one dimension says very little about the distance in the full space.

The reason for wanting a dataset of more than 2k is quite intuitive: we split the dataset in two halves of equal size along each dimension. If we have fewer than 2k data points, after a while there will be no more data to split! For example, if you have 4 points in 3 dimensions, we can split on x, giving two sets of two points. We split this on y, giving four sets of one point. But now we can't split on z anymore!

like image 169
Jordi Vermeulen Avatar answered Jan 27 '23 10:01

Jordi Vermeulen