There are several nearest neighbor R packages (e.g., FNN, RANN, yaImpute) but none of them seem to allow saving off the NN data structure (cover tree, KD tree etc.) so that the nearest neighbors of new queries can be calculated without reconstructing the whole tree. Are there any such functions in R?
I am looking for a function that returns a data structure that I can update incrementally as new data arrives to perform approximate K nearest neighbor search.
A major problem with the simple nearest-neighbor algorithm is that it considers the entire set of n points for every execution. However, consider the Ann and Aknn problems where the same dataset is used n times.
In approximately nearest neighbors (ANN), we build index structures that narrow down the search space. The implicit neighborhoods in such indexes also help reduce the problem of high dimensions. You can roughly divide the approaches used for ANNs into whether or not they can be implemented using an inverse index.
There is a good reason why no NN package does that.
The reason is that the "NN data structure" necessarily includes all the input data points (in the form of a KD tree), so there is no space savings against the input data. It appears that there would be time savings in not having to re-create the KD-tree for each new input, but this is not the case, alas.
The reason is that the time to build a KD-tree is, in general, worse than linearithmic. This means that, for large inputs, it makes sense to sort the data before building the KD-tree because that will produce the KD-tree faster and it will be better balanced, which will improve the search too (it is also worse than logarithmic, in general). This approach would speed up modeling and evaluation but discourage incremental updates, of course.
Your best bet, I think, if to find a generic KD-tree package and use it instead.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With