Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently generating random graphs with a user-specified global clustering coefficient

I'm working on simulations of large-scale neuronal networks, for which I need to generate random graphs that represent the network topology.

I'd like to be able to specify the following properties of these graphs:

  • Number of nodes, N (~=1000-10000)
  • Average probability of a connection between any two given nodes, p (~0.01-0.2)
  • Global clustering coefficient, C (~0.1-0.5)

Ideally, the random graphs should be drawn uniformly from the set of all possible graphs that satisfy these user-specified criteria.

At the moment I'm using a very crude random diffusion approach where I start out with an Erdos-Renyi random network with the desired size and global connection probability, then on each step I randomly rewire some fraction of the edges. If the rewiring got me closer to the desired C then I keep the rewired network into the next iteration.

Here's my current Python implementation:

import igraph
import numpy as np

def generate_fixed_gcc(n, p, target_gcc, tol=1E-3):
    """
    Creates an Erdos-Renyi random graph of size n with a specified global
    connection probability p, which is then iteratively rewired in order to
    achieve a user- specified global clustering coefficient.
    """

    # initialize random graph
    G_best = igraph.Graph.Erdos_Renyi(n=n, p=p, directed=True, loops=False)

    loss_best = 1.
    n_edges = G_best.ecount()

    # start with a high rewiring rate
    rewiring_rate = n_edges
    n_iter = 0

    while loss_best > tol:

        # operate on a copy of the current best graph
        G = G_best.copy()

        # adjust the number of connections to rewire according to the current
        # best loss
        n_rewire = min(max(int(rewiring_rate * loss_best), 1), n_edges)
        G.rewire(n=n_rewire)

        # compute the global clustering coefficient
        gcc = G.transitivity_undirected()
        loss = abs(gcc - target_gcc)

        # did we improve?
        if loss < loss_best:

            # keep the new graph
            G_best = G
            loss_best = loss
            gcc_best = gcc

            # increase the rewiring rate
            rewiring_rate *= 1.1

        else:

            # reduce the rewiring rate
            rewiring_rate *= 0.9

        n_iter += 1

    # get adjacency matrix as a boolean numpy array
    M = np.array(G_best.get_adjacency().data, dtype=np.bool)

    return M, n_iter, gcc_best

This is works OK for small networks (N < 500), but it quickly becomes intractable as the number of nodes increases. It takes on the order of about 20 sec to generate a 200 node graph, and several days to generate a 1000 node graph.

Can anyone suggest an efficient way to do this?

like image 661
ali_m Avatar asked Dec 17 '14 12:12

ali_m


2 Answers

You are right. That is a very expensive method to achieve what you want. I can only speculate if there is a mathematically sound way to optimize and ensure that it is close to being a uniform distribution. I'm not even sure that your method leads to a uniform distribution although it seems like it would. Let me try:

Based on the docs for transitivity_undirected and wikipedia Clustering Coefficient, it sounds like it is possible to make changes locally in the graph and at the same time know the exact effect on global connectivity and global clustering.

The global clustering coefficient is based on triplets of nodes. A triplet consists of three nodes that are connected by either two (open triplet) or three (closed triplet) undirected ties. A triangle consists of three closed triplets, one centred on each of the nodes. The global clustering coefficient is the number of closed triplets (or 3 x triangles) over the total number of triplets (both open and closed).


( * edit * ) Based on my reading of the paper referenced by ali_m, the method below will probably spend too many edges on low-degree clusters, leading to a graph that cannot achieve the desired clustering coefficient unless it is very low (which probably wouldn't be useful anyway). Therefore, on the off chance that somebody actually uses this, you will want to identify higher degree clusters to add edges to in order to quickly raise the clustering coefficient without needing to add a lot of edges.

On the other hand, the method below does align with the methods in the research paper so it's more or less a reasonable approach.


If I understand it correctly, you could do the following:

  1. Produce the graph as you have done.

  2. Calculate and Track:

    • p_surplus to track the number of edges that need to be added or removed elsewhere to maintain connectivity
    • cc_top, cc_btm to track the clustering coefficient
  3. Iteratively (not completely) choose random pairs and connect or disconnect them to monotonically approach the Clustering Coefficient (cc) you want while maintaining the Connectivity (p) you already have.

    Pseudo code:

    for random_pair in random_pairs:
    
        if (random_pair is connected) and (need to reduce cc or p):  # maybe put priority on the one that has a larger gap?
            delete the edge
            p_surplus -= 1
            cc_top -= broken_connected_triplets  # have to search locally
            cc_btm -= (broken_connected_triplets + broken_open_triplets)  # have to search locally
        elif (random_pair is not connected) add (need to increase c or p):
            add the edge
            p_surplus += 1
            cc_top += new_connected_triplets
            cc_btm += (new_connected_triplets + new_open_triplets)
    
        if cc and p are within desired ranges:
            done
    
        if some condition for detecting infinite loops:
            rethink this method
    

That may not be totally correct, but I think the approach will work. The efficiency of searching for local triplets and always moving your parameters in the right direction will be better than copying the graph and globally measuring the cc so many times.

like image 104
KobeJohn Avatar answered Nov 15 '22 02:11

KobeJohn


Having done a bit of reading, it looks as though the best solution might be the generalized version of Gleeson's algorithm presented in this paper. However, I still don't really understand how to implement it, so for the time being I've been working on Bansal et al's algorithm.

Like my naive approach, this is a Markov chain-based method that uses random edge swaps, but unlike mine it specifically targets 'triplet motifs' within the graph for rewiring:

enter image description here

Since this will have a greater tendency to introduce triangles, it will therefore have a greater impact on the clustering coefficient. At least in the case of undirected graphs, the rewiring step is also guaranteed to preserve the degree sequence. Again, on every rewiring iteration the new global clustering coefficient is measured, and the new graph is accepted if the GCC got closer to the target value.

Bansal et al actually provided a Python implementation, but for various reasons I ended up writing my own version, which you can find here.

Performance

The Bansal approach takes just over half the number of iterations and half the total time compared with my naive diffusion method:

enter image description here

I was hoping for bigger gains, but a 2x speedup is better than nothing.

Generalizing to directed graphs

One remaining challenge with the Bansal method is that my graphs are directed, whereas Bansal et al's algorithm is only designed to work on undirected graphs. With a directed graph, the rewiring step is no longer guaranteed to preserve the in- and out-degree sequences.


Update

I've just figured out how to generalize the Bansal method to preserve both the in- and out-degree sequences for directed graphs. The trick is to select motifs where the two outward edges to be swapped have opposite directions (the directions of the edges between {x, y1} and {x, y2} don't matter):

enter image description here

I've also made some more optimizations, and the performance is starting to look a bit more respectable - it takes roughly half the number of iterations and half the total time compared with the diffusion approach. I've updated the graphs above with the new timings.

like image 21
ali_m Avatar answered Nov 15 '22 01:11

ali_m