Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I cluster a graph in Python?

Let G be a graph. So G is a set of nodes and set of links. I need to find a fast way to partition the graph. The graph I am now working has only 120*160 nodes, but I might soon be working on an equivalent problem, in another context (not medicine, but website development), with millions of nodes.

So, what I did was to store all the links into a graph matrix:

M=numpy.mat(numpy.zeros((len(data.keys()),len(data.keys()))))

Now M holds a 1 in position s,t, if node s is connected to node t. I make sure M is symmetrical M[s,t]=M[t,s] and each node links to itself M[s,s]=1.

If I remember well if I multiply M with M, the results is a matrix that represents the graph that connects vertexes that are reached on through two steps.

So I keep on multplying M with itself, until the number of zeros in the matrix do not decrease any longer. Now I have the list of the connected components. And now I need to cluster this matrix.

Up to now I am pretty satisfied with the algorithm. I think it is easy, elegant, and reasonably fast. I am having trouble with this part.

Essentially I need to split this graph into its connected components.

I can go through all the nodes, and see what are they connected to.

But what about sorting the matrix reordering the lines. But I don't know if it is possible to do it.

What follows is the code so far:

def findzeros(M):
    nZeros=0
    for t in M.flat:
        if not t:
            nZeros+=1
    return nZeros

M=numpy.mat(numpy.zeros((len(data.keys()),len(data.keys()))))    
for s in data.keys():
    MatrixCells[s,s]=1
    for t in data.keys():
        if t<s:
            if (scipy.corrcoef(data[t],data[s])[0,1])>threashold:
                M[s,t]=1
                M[t,s]=1

nZeros=findzeros(M)
M2=M*M
nZeros2=findzeros(M2)

while (nZeros-nZeros2):
    nZeros=nZeros2
    M=M2
    M2=M*M
    nZeros2=findzeros(M2)

Edit:

It has been suggested that I use SVD decomposition. Here is a simple example of the problem on a 5x5 graph. We shall use this since with the 19200x19200 square matrix is not that easy to see the clusters.

import numpy
import scipy

M=numpy.mat(numpy.zeros((5,5)))

M[1,3]=1
M[3,1]=1
M[1,1]=1
M[2,2]=1
M[3,3]=1
M[4,4]=1
M[0,0]=1

print M

u,s,vh = numpy.linalg.linalg.svd(M)
print u
print s
print vh

Essentially there are 4 clusters here: (0),(1,3),(2),(4) But I still don't see how the svn can help in this context.

like image 938
Pietro Speroni Avatar asked Mar 17 '09 09:03

Pietro Speroni


People also ask

What is clustering on a graph?

Graph clustering is to group the vertices of a graph into clusters based on the graph structure and/or node attributes. Various works (Zhang et al., 2019c) in node representation learning are developed and the representation of nodes can be passed to traditional clustering algorithms.

How do I visualize clusters in Kmeans?

This template helps you identify and visualize clusters in your data. The K Means algorithm aims to assign the data points in your dataset to K distinct clusters. After running the algorithm each observation (data point) will belong to the cluster whose center it is closest to.

How do you visualize a cluster?

To visualize the clusters you can use one of the most popular methods for dimensionality reduction, namely PCA and t-SNE.


2 Answers

Why not use a real graph library, like Python-Graph? It has a function to determine connected components (though no example is provided). I'd imagine a dedicated library is going to be faster than whatever ad-hoc graph code you've cooked up.

EDIT: NetworkX seems like it might be a better choice than python-graph; its documentation (here for the connected components function) certainly is.

like image 93
kquinn Avatar answered Oct 22 '22 20:10

kquinn


In SciPy you can use sparse matrices. Also note, that there are more efficient ways of multiplying matrix by itself. Anyway, what you're trying to do can by done by SVD decomposition.

Introduction with useful links.

like image 23
vartec Avatar answered Oct 22 '22 18:10

vartec