Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python, Scipy: Building triplets using large adjacency matrix

I am using an adjacency matrix to represent a network of friends which can be visually interpreted as

Mary     0        1      1      1

Joe      1        0      1      1

Bob      1        1      0      1

Susan    1        1      1      0 

         Mary     Joe    Bob    Susan

Using this matrix, I want to compile a list of all possible friendship triangles with the condition that user 1 is friends with user 2, and user 2 is friends with user 3. For my list, it is not required that user 1 is friends with user 3.

(joe, mary, bob)
(joe, mary, susan)
(bob, mary, susan)
(bob, joe, susan)

I have a bit of code that works well with small triangles, but I need it to scale for very large sparse matrices.

from numpy import *
from scipy import *

def buildTriangles(G):
    # G is a sparse adjacency matrix
    start = time.time()
    ctr = 0
    G = G + G.T          # I do this to make sure it is symmetric
    triples = []
    for i in arange(G.shape[0] - 1):  # for each row but the last one
        J,J = G[i,:].nonzero()        # J: primary friends of user i
                                      # I do J,J because I do not care about the row values
        J = J[ J < i ]                # only computer the lower triangle to avoid repetition
        for j in J:
            K, buff = G[:,j].nonzero() # K: secondary friends of user i
            K = K[ K > i ]             # only compute below i to avoid repetition
            for k in K:
                ctr = ctr + 1
                triples.append( (i,j,k) )
    print("total number of triples: %d" % ctr)
    print("run time is %.2f" % (time.time() - start())
    return triples

I was able to run the code on a csr_matrix in approximately 21 minutes. The matrix was 1032570 x 1032570 and contained 88910 stored elements. There were a total of 2178893 triplets generated.

I need to be able to do something similar with a 1968654 x 1968654 sparse matrix with 9428596 stored elements.

I'm very new to python (little less than a month of experience) and not the greatest at linear algebra, which is why my code does not take advantage of matrices operations. Can anyone make any suggestions for improvement or let me know if my objective is even realistic?

like image 337
will Avatar asked Aug 03 '11 19:08

will


2 Answers

I think you can find triangles only in rows or columns. for example:

Susan    1        1      1      0 
        Mary     Joe    Bob    Susan

this means Mary, Joe, Bob are all friends of Susan, so, use combinations to choose two person from [Mary, Joe, Bob], and combine it with Susan will get one triangle. itertools.combinations() do this quickly.

Here is the code:

import itertools
import numpy as np

G = np.array(   # clear half of the matrix first
    [[0,0,0,0],
     [1,0,0,0],
     [1,1,0,0],
     [1,1,1,0]])
triples = []     
for i in xrange(G.shape[0]):
    row = G[i,:]
    J = np.nonzero(row)[0].tolist() # combinations() with list is faster than NumPy array.
    for t1,t2 in itertools.combinations(J, 2):
        triples.append((i,t1,t2))
print triples
like image 198
HYRY Avatar answered Nov 05 '22 01:11

HYRY


Here's some suggestions for optimization:

K = K[ K > i ]             # only compute below i to avoid repetition
for k in K:
    ctr = ctr + 1
    triples.append( (i,j,k) )

Don't increment in a loop, it's terribly slow. Just ctr += K.shape[0] will do. Then, eliminate the most deeply nested loop altogether by replacing the append with

triples += ((i, j, k) for k in K[K > i])

Now, if you want real performance on this task, you will have to get into some linear algebra. "I want to compile a list of all possible friendship triangles" means that you want to square the adjacency matrix, which you can do with a simple **2.

Then realize that 1.968.654² means a very big matrix, and even though it's very sparse, its square will be much less so and will take a lot of memory. (I once tackled a similar problem where I considered links between Wikipedia articles at distance two, which took 20 minutes to solve, on a supercomputer cluster node, in C++. This is not a trivial problem. The Wikipedia adjacency matrix was a few orders of magnitude denser, though.)

like image 23
Fred Foo Avatar answered Nov 05 '22 01:11

Fred Foo