I have a large corpus of data (text) that I have converted to a sparse term-document matrix (I am using scipy.sparse.csr.csr_matrix
to store sparse matrix). I want to find, for every document, top n nearest neighbour matches. I was hoping that NearestNeighbor
routine in Python scikit-learn
library (sklearn.neighbors.NearestNeighbor
to be precise) would solve my problem, but efficient algorithms that use space partitioning data structures such as KD trees
or Ball trees
do not work with sparse matrices. Only brute-force algorithm works with sparse matrices (which is infeasible in my case as I am dealing with large corpus).
Is there any efficient implementation of nearest neighbour search for sparse matrices (in Python or in any other language)?
Thanks.
In KNN, K is the number of nearest neighbors. The number of neighbors is the core deciding factor. K is generally an odd number if the number of classes is 2. When K=1, then the algorithm is known as the nearest neighbor algorithm.
NearestNeighbors implements unsupervised nearest neighbors learning. It acts as a uniform interface to three different nearest neighbors algorithms: BallTree , KDTree , and a brute-force algorithm based on routines in sklearn.
All nearest neighbors As a simple example: when we find the distance from point X to point Y, that also tells us the distance from point Y to point X, so the same calculation can be reused in two different queries.
Late answer: Have a look at Locality-Sensitive-Hashing
Support in scikit-learn has been proposed here and here.
You can try to transform your high-dimensional sparse data to low-dimensional dense data using TruncatedSVD then do a ball-tree.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With