Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient nearest neighbour search for sparse matrices

I have a large corpus of data (text) that I have converted to a sparse term-document matrix (I am using scipy.sparse.csr.csr_matrix to store sparse matrix). I want to find, for every document, top n nearest neighbour matches. I was hoping that NearestNeighbor routine in Python scikit-learn library (sklearn.neighbors.NearestNeighbor to be precise) would solve my problem, but efficient algorithms that use space partitioning data structures such as KD trees or Ball trees do not work with sparse matrices. Only brute-force algorithm works with sparse matrices (which is infeasible in my case as I am dealing with large corpus).

Is there any efficient implementation of nearest neighbour search for sparse matrices (in Python or in any other language)?

Thanks.

like image 908
abhinavkulkarni Avatar asked Aug 10 '13 17:08

abhinavkulkarni


People also ask

How do I choose the number of neighbors in KNN?

In KNN, K is the number of nearest neighbors. The number of neighbors is the core deciding factor. K is generally an odd number if the number of classes is 2. When K=1, then the algorithm is known as the nearest neighbor algorithm.

Which method is used to find the nearest neighbors distance in Sklearn?

NearestNeighbors implements unsupervised nearest neighbors learning. It acts as a uniform interface to three different nearest neighbors algorithms: BallTree , KDTree , and a brute-force algorithm based on routines in sklearn.

What is nearest neighbor search explain with example?

All nearest neighbors As a simple example: when we find the distance from point X to point Y, that also tells us the distance from point Y to point X, so the same calculation can be reused in two different queries.


2 Answers

Late answer: Have a look at Locality-Sensitive-Hashing

Support in scikit-learn has been proposed here and here.

like image 175
Unapiedra Avatar answered Nov 13 '22 08:11

Unapiedra


You can try to transform your high-dimensional sparse data to low-dimensional dense data using TruncatedSVD then do a ball-tree.

like image 45
Mathieu Avatar answered Nov 13 '22 09:11

Mathieu