I have a large sparse numpy/scipy matrix where each row corresponds to a point in high-dimensional space. I want make queries of the following kind:
Given a point P (a row in the matrix) and a distance epsilon, find all points with distance at most epsilon from P.
The distance metric I am using is Jaccard-similarity, so it should be possible to use Locality Sensitive Hashing tricks such as MinHash.
Is there an implementation of MinHash for sparse numpy arrays somewhere (I can't seem to find one) or is there an easy way to do this?
The reason I am not just pulling something built for non-sparse arrays off of Github is that the sparse data structures in scipy might cause explosions in time complexity.
In computer science, locality-sensitive hashing (LSH) is an algorithmic technique that hashes similar input items into the same "buckets" with high probability. (The number of buckets is much smaller than the universe of possible input items.)
LSH has many applications, including: Near-duplicate detection: LSH is commonly used to deduplicate large quantities of documents, webpages, and other files. Genome-wide association study: Biologists often use LSH to identify similar gene expressions in genome databases.
Why LSH? LSH can be considered an algorithm for dimensionality reduction. A problem that arises when we recommend items from large datasets is that there may be too many pairs of items to calculate the similarity of each pair. Also, we likely will have sparse amounts of overlapping data for all items.
If you have very large sparse datasets that are too large to be held in memory in a non-sparse format, I'd try out this LSH implementation that is built around the assumption of Scipy's CSR Sparse Matrices:
https://github.com/brandonrobertz/SparseLSH
It also hash support for disk-based key-value stores like LevelDB if you can't fit the tables in memory. From the docs:
from sparselsh import LSH
from scipy.sparse import csr_matrix
X = csr_matrix( [
[ 3, 0, 0, 0, 0, 0, -1],
[ 0, 1, 0, 0, 0, 0, 1],
[ 1, 1, 1, 1, 1, 1, 1] ])
# One class number for each input point
y = [ 0, 3, 10]
X_sim = csr_matrix( [ [ 1, 1, 1, 1, 1, 1, 0]])
lsh = LSH( 4,
X.shape[1],
num_hashtables=1,
storage_config={"dict":None})
for ix in xrange(X.shape[0]):
x = X.getrow(ix)
c = y[ix]
lsh.index( x, extra_data=c)
# find points similar to X_sim
lsh.query(X_sim, num_results=1)
If you definitely only want to use MinHash, you could try out https://github.com/go2starr/lshhdc, but I haven't personally tested that one out for compatibility with sparse matrices.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With