Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

efficient way to calculate distance between combinations of pandas frame columns

Task

I have a pandas dataframe where:

  • the columns are document names
  • the rows are words in those documents
  • numbers inside the frame cells are a measure of word relevance (word count if you want to keep it simple)

I need to calculate a new matrix of doc1-doc similarity where:

  • rows and columns are document names
  • the cells inside the frame are a similarity measure, (1 - cosine distance) between the two documents

The cosine distance is conveniently provided by script.spatial.distance.cosine.

I'm currently doing this:

  1. use itertools to create a list of all 2-combinations of the document names (dataframe columns names)
  2. loop over these and create a update a dictionary of {doc1: {doc2: similarity}}
  3. after the loop, create a new frame using pandas.DataFrame(dict)

Problem

But it takes a very very long time. The following shows current speed on a MacBook Pro 13 with 16GB ram and 2.9GHz i5cpu running latest anaconda python 3.5 ... plotting time taken against combinations of docs.

distance calculation performance

You can see that 100,000 combinations takes 1200 seconds. Extrapolating that to my corpus of 7944 documents, which creates 31,549,596 combinations, would take 5 days to calculate this similarity matrix!

Any ideas?

  • I previously was dynamically creating the dataframe df.ix[doc1,doc2] = similarity .. which was very very much slower.
  • I've considered numba @git but it fails with pandas data structures.
  • I can't find a built in function which will do all the work internally (in C?)
  • What I have to do tactically is to randomly sample the documents to create a much smaller set to work with ... currently a fraction of 0.02 leads to about 20 minutes of calculation!

Here's the code (github)

docs_combinations = itertools.combinations(docs_sample, 2)
for doc1, doc2 in docs_combinations:
    # scipy cosine similarity function includes normalising the vectors but is a distance .. so we need to take it from 1.0
    doc_similarity_dict[doc2].update({doc1: 1.0 - scipy.spatial.distance.cosine(relevance_index[doc1],relevance_index[doc2])})
    pass

#convert dict to pandas dataframe
doc_similarity_matrix = pandas.DataFrame(doc_similarity_dict)

Simple Example

@MaxU asked for an illustrative example.

Relevance matrix (wordcount here, just to keep it simple):

...     doc1 doc2 doc3
wheel   2.   3.   0.
seat    2.   2.   0.
lights  0.   1.   1.
cake    0.   0.   5.

calculated similarity matrix based on 2-combinations (doc1, doc2), (doc2, doc3), (doc1, doc3)

...     doc2 doc3
doc1    0.9449  0.
doc2    -       0.052

Take that top left value 0.889 .. thats the dot product (2*3 + 2*2 + 0 + 0) = 10 but normalised by the lengths of the vectors ... so divide by sqrt(8) and sqrt(14) gives 0.9449. You can see that there is no similarity between doc1 and doc3 .. the dot product is zero.

Scale this from 3 documents with 4 words ... to 7944 documents, which creates 31,549,596 combinations ...

like image 653
MYO TextMiningToolkit Avatar asked Nov 16 '16 11:11

MYO TextMiningToolkit


2 Answers

This is about as efficient as I can make an algorithm without moving into multiprocessing (bleh). The function uses numpy arrays for all of the calculations.

def cos_sim(data_frame):
    # create a numpy array from the data frame
    a = data_frame.values
    # get the number of documents
    n = a.shape[-1]
    # create an array of size docs x docs to populate
    out = np.ravel(np.zeros(shape=(n, n)))

    for i in range(n):
        # roll the array one step at a time, calculating the cosine similarity each time
        r = np.roll(a, -i, axis=1)
        cs = np.sum(a[:,:n-i]*r[:,:n-i], axis=0) / (
                np.sqrt(np.sum(a[:,:n-i]*a[:,:n-i], axis=0))
                *np.sqrt(np.sum(r[:,:n-i]*r[:,:n-i], axis=0)))

        # push the cosine similarity to the output array's i-th off-diagonal
        out[i:n*n-i*n:n+1] = cs

    return out.reshape((n,n))
like image 177
James Avatar answered Oct 06 '22 18:10

James


Numba will be a good solution for this. As I think you know, it does not support Pandas DataFrames, but it is built around NumPy arrays. This is not a problem--you can easily and quickly convert your DataFrame to a 2D array and pass that to a Numba function (which will be pretty much the code you already have, just decorated with @njit at the top).

Also note that instead of a dict-of-dicts for the results, you can use one triangle of a square matrix to store them:

     doc1 doc2 doc3
doc1  NAN  NAN  NAN
doc2  ...  NAN  NAN
doc3  ...  ...  NAN

Edit: You've now implemented it using Numba, but got only a 2.5x speedup. I ran some experiments and found a big win:

In [66]: x = np.random.random((1000,1000))

In [67]: y = np.array(x, order='F')

In [68]: %timeit similarity_jit(x)
1 loop, best of 3: 13.7 s per loop

In [69]: %timeit similarity_jit(y)
1 loop, best of 3: 433 ms per loop

That is, your algorithm will be much, much faster if you operate on contiguous chunks of data, due to caching. Since the kernel of your algorithm is numpy.dot(m[:,i], m[:,j]), and m[:,i] takes one column, you are better off orienting your data in "Fortran order" (column-major order) first, so that m[:,i] gives one contiguous array (because the array is laid out "transposed" in memory).

like image 39
John Zwinck Avatar answered Oct 06 '22 18:10

John Zwinck