Task
I have a pandas dataframe where:
I need to calculate a new matrix of doc1-doc similarity where:
The cosine distance is conveniently provided by script.spatial.distance.cosine.
I'm currently doing this:
Problem
But it takes a very very long time. The following shows current speed on a MacBook Pro 13 with 16GB ram and 2.9GHz i5cpu running latest anaconda python 3.5 ... plotting time taken against combinations of docs.
You can see that 100,000 combinations takes 1200 seconds. Extrapolating that to my corpus of 7944 documents, which creates 31,549,596 combinations, would take 5 days to calculate this similarity matrix!
Any ideas?
Here's the code (github)
docs_combinations = itertools.combinations(docs_sample, 2)
for doc1, doc2 in docs_combinations:
# scipy cosine similarity function includes normalising the vectors but is a distance .. so we need to take it from 1.0
doc_similarity_dict[doc2].update({doc1: 1.0 - scipy.spatial.distance.cosine(relevance_index[doc1],relevance_index[doc2])})
pass
#convert dict to pandas dataframe
doc_similarity_matrix = pandas.DataFrame(doc_similarity_dict)
Simple Example
@MaxU asked for an illustrative example.
Relevance matrix (wordcount here, just to keep it simple):
... doc1 doc2 doc3
wheel 2. 3. 0.
seat 2. 2. 0.
lights 0. 1. 1.
cake 0. 0. 5.
calculated similarity matrix based on 2-combinations (doc1, doc2), (doc2, doc3), (doc1, doc3)
... doc2 doc3
doc1 0.9449 0.
doc2 - 0.052
Take that top left value 0.889 .. thats the dot product (2*3 + 2*2 + 0 + 0) = 10 but normalised by the lengths of the vectors ... so divide by sqrt(8) and sqrt(14) gives 0.9449. You can see that there is no similarity between doc1 and doc3 .. the dot product is zero.
Scale this from 3 documents with 4 words ... to 7944 documents, which creates 31,549,596 combinations ...
This is about as efficient as I can make an algorithm without moving into multiprocessing (bleh). The function uses numpy arrays for all of the calculations.
def cos_sim(data_frame):
# create a numpy array from the data frame
a = data_frame.values
# get the number of documents
n = a.shape[-1]
# create an array of size docs x docs to populate
out = np.ravel(np.zeros(shape=(n, n)))
for i in range(n):
# roll the array one step at a time, calculating the cosine similarity each time
r = np.roll(a, -i, axis=1)
cs = np.sum(a[:,:n-i]*r[:,:n-i], axis=0) / (
np.sqrt(np.sum(a[:,:n-i]*a[:,:n-i], axis=0))
*np.sqrt(np.sum(r[:,:n-i]*r[:,:n-i], axis=0)))
# push the cosine similarity to the output array's i-th off-diagonal
out[i:n*n-i*n:n+1] = cs
return out.reshape((n,n))
Numba will be a good solution for this. As I think you know, it does not support Pandas DataFrames, but it is built around NumPy arrays. This is not a problem--you can easily and quickly convert your DataFrame to a 2D array and pass that to a Numba function (which will be pretty much the code you already have, just decorated with @njit
at the top).
Also note that instead of a dict-of-dicts for the results, you can use one triangle of a square matrix to store them:
doc1 doc2 doc3
doc1 NAN NAN NAN
doc2 ... NAN NAN
doc3 ... ... NAN
Edit: You've now implemented it using Numba, but got only a 2.5x speedup. I ran some experiments and found a big win:
In [66]: x = np.random.random((1000,1000))
In [67]: y = np.array(x, order='F')
In [68]: %timeit similarity_jit(x)
1 loop, best of 3: 13.7 s per loop
In [69]: %timeit similarity_jit(y)
1 loop, best of 3: 433 ms per loop
That is, your algorithm will be much, much faster if you operate on contiguous chunks of data, due to caching. Since the kernel of your algorithm is numpy.dot(m[:,i], m[:,j])
, and m[:,i]
takes one column, you are better off orienting your data in "Fortran order" (column-major order) first, so that m[:,i]
gives one contiguous array (because the array is laid out "transposed" in memory).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With