The code below causes my system to run out of memory before it completes.
Can you suggest a more efficient means of computing the cosine similarity on a large matrix, such as the one below?
I would like to have the cosine similarity computed for each of the 65000 rows in my original matrix (mat
) relative to all of the others so that the result is a 65000 x 65000 matrix where each element is the cosine similarity between two rows in the original matrix.
import numpy as np
from scipy import sparse
from sklearn.metrics.pairwise import cosine_similarity
mat = np.random.rand(65000, 10)
sparse_mat = sparse.csr_matrix(mat)
similarities = cosine_similarity(sparse_mat)
After running that last line I always run out of memory and the program either freezes or crashes with a MemoryError. This occurs whether I run on my 8 gb local RAM or on a 64 gb EC2 instance.
Same problem here. I've got a big, non-sparse matrix. It fits in memory just fine, but cosine_similarity
crashes for whatever unknown reason, probably because they copy the matrix one time too many somewhere. So I made it compare small batches of rows "on the left" instead of the entire matrix:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
def cosine_similarity_n_space(m1, m2, batch_size=100):
assert m1.shape[1] == m2.shape[1]
ret = np.ndarray((m1.shape[0], m2.shape[0]))
for row_i in range(0, int(m1.shape[0] / batch_size) + 1):
start = row_i * batch_size
end = min([(row_i + 1) * batch_size, m1.shape[0]])
if end <= start:
break # cause I'm too lazy to elegantly handle edge cases
rows = m1[start: end]
sim = cosine_similarity(rows, m2) # rows is O(1) size
ret[start: end] = sim
return ret
No crashes for me; YMMV. Try different batch sizes to make it faster. I used to only compare 1 row at a time, and it took about 30X as long on my machine.
Stupid yet effective sanity check:
import random
while True:
m = np.random.rand(random.randint(1, 100), random.randint(1, 100))
n = np.random.rand(random.randint(1, 100), m.shape[1])
assert np.allclose(cosine_similarity(m, n), cosine_similarity_n_space(m, n))
I would run it in chunks like this
from sklearn.metrics.pairwise import cosine_similarity
# Change chunk_size to control resource consumption and speed
# Higher chunk_size means more memory/RAM needed but also faster
chunk_size = 500
matrix_len = your_matrix.shape[0] # Not sparse numpy.ndarray
def similarity_cosine_by_chunk(start, end):
if end > matrix_len:
end = matrix_len
return cosine_similarity(X=your_matrix[start:end], Y=your_matrix) # scikit-learn function
for chunk_start in xrange(0, matrix_len, chunk_size):
cosine_similarity_chunk = similarity_cosine_by_chunk(chunk_start, chunk_start+chunk_size)
# Handle cosine_similarity_chunk ( Write it to file_timestamp and close the file )
# Do not open the same file again or you may end up with out of memory after few chunks
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With