Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Locality Sensitive Hashing of sparse numpy arrays

I have a large sparse numpy/scipy matrix where each row corresponds to a point in high-dimensional space. I want make queries of the following kind:

Given a point P (a row in the matrix) and a distance epsilon, find all points with distance at most epsilon from P.

The distance metric I am using is Jaccard-similarity, so it should be possible to use Locality Sensitive Hashing tricks such as MinHash.

Is there an implementation of MinHash for sparse numpy arrays somewhere (I can't seem to find one) or is there an easy way to do this?

The reason I am not just pulling something built for non-sparse arrays off of Github is that the sparse data structures in scipy might cause explosions in time complexity.

like image 448
utdiscant Avatar asked Jul 17 '14 11:07

utdiscant


People also ask

How does locality sensitive hashing work?

In computer science, locality-sensitive hashing (LSH) is an algorithmic technique that hashes similar input items into the same "buckets" with high probability. (The number of buckets is much smaller than the universe of possible input items.)

Where is locality sensitive hashing used?

LSH has many applications, including: Near-duplicate detection: LSH is commonly used to deduplicate large quantities of documents, webpages, and other files. Genome-wide association study: Biologists often use LSH to identify similar gene expressions in genome databases.

Why locality sensitive hashing LSH can be used as a recommendation system?

Why LSH? LSH can be considered an algorithm for dimensionality reduction. A problem that arises when we recommend items from large datasets is that there may be too many pairs of items to calculate the similarity of each pair. Also, we likely will have sparse amounts of overlapping data for all items.


1 Answers

If you have very large sparse datasets that are too large to be held in memory in a non-sparse format, I'd try out this LSH implementation that is built around the assumption of Scipy's CSR Sparse Matrices:

https://github.com/brandonrobertz/SparseLSH

It also hash support for disk-based key-value stores like LevelDB if you can't fit the tables in memory. From the docs:

from sparselsh import LSH
from scipy.sparse import csr_matrix

X = csr_matrix( [
    [ 3, 0, 0, 0, 0, 0, -1],
    [ 0, 1, 0, 0, 0, 0,  1],
    [ 1, 1, 1, 1, 1, 1,  1] ])

# One class number for each input point
y = [ 0, 3, 10]

X_sim = csr_matrix( [ [ 1, 1, 1, 1, 1, 1, 0]])

lsh = LSH( 4,
           X.shape[1],
           num_hashtables=1,
           storage_config={"dict":None})

for ix in xrange(X.shape[0]):
    x = X.getrow(ix)
    c = y[ix]
    lsh.index( x, extra_data=c)

# find points similar to X_sim
lsh.query(X_sim, num_results=1)

If you definitely only want to use MinHash, you could try out https://github.com/go2starr/lshhdc, but I haven't personally tested that one out for compatibility with sparse matrices.

like image 109
COOLZXxX Avatar answered Oct 11 '22 01:10

COOLZXxX