Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clauset-Newman-Moore community detection implementation

I am trying to implement the above community detection algorithm in Java, and while I have access to C++ code, and the original paper - I can't make it work at all. My major issue is that I don't understand the purpose of the code - i.e. how the algorithm works. In practical terms, my code gets stuck in what seems to be an infinite loop at mergeBestQ, the list heap seems to be getting larger on each iteration (as I would expect from the code), but the value of topQ is always returning the same value.

The graph I am testing this on is quite large (300,000 nodes, 650,000 edges). The original code I am using for my implementation is from the SNAP library (https://github.com/snap-stanford/snap/blob/master/snap-core/cmty.cpp). What would be great is if someone could explain to me the intuition of the algorithm, it seems to be initially setting each node to be in it's own community, then recording the modularity value (whatever that is) of each pair of connected nodes in the graph, then finding the pair of nodes which have the highest modularity and moving them to the same community. In addition, if someone could provide some mid level pseudo code, that would be great. Here is my implementation thus far, I have tried to keep it in one file for the sake of brevity, however CommunityGraph and CommunityNode are elsewhere (should not be required). Graph maintain a list of all nodes and each node maintains a list of its connections to other nodes. When running it never gets past the line while(this.mergeBestQ()){}

UPDATE - found several bugs in my code after a thorough review. The code now completes VERY quickly, but doesnt fully implement the algorithm, for example of the 300,000 nodes in the graph, it states there are approximately 299,000 communities (i.e. roughly 1 node per community). I have listed the updated code below. /// Clauset-Newman-Moore community detection method. /// At every step two communities that contribute maximum positive value to global modularity are merged. /// See: Finding community structure in very large networks, A. Clauset, M.E.J. Newman, C. Moore, 2004 public class CNMMCommunityMetric implements CommunityMetric{ private static class DoubleIntInt implements Comparable{ public double val1; public int val2; public int val3; DoubleIntInt(double val1, int val2, int val3){ this.val1 = val1; this.val2 = val2; this.val3 = val3; }

    @Override
    public int compareTo(DoubleIntInt o) {
      //int this_sum = this.val2 + this.val3;
      //int oth_sum = o.val2 + o.val3;
      if(this.equals(o)){
        return 0;
      }
      else if(val1 < o.val1 || (val1 == o.val1 && val2 < o.val2) || (val1 == o.val1 && val2 == o.val2 && val3 < o.val3)){
        return 1;
      }
      else{
        return -1;
      }
      //return this.val1 < o.val1 ? 1 : (this.val1 > o.val1 ? -1 : this_sum - oth_sum);
    }

    @Override
    public boolean equals(Object o){
      return this.val2 == ((DoubleIntInt)o).val2 && this.val3 == ((DoubleIntInt)o).val3;
    }

    @Override
    public int hashCode() {
      int hash = 3;
      hash = 79 * hash + this.val2;
      hash = 79 * hash + this.val3;
      return hash;
    }
  }

  private static class CommunityData {
    double DegFrac;
    TIntDoubleHashMap nodeToQ = new TIntDoubleHashMap();
    int maxQId;

    CommunityData(){
      maxQId = -1;
    }

    CommunityData(double nodeDegFrac, int outDeg){
      DegFrac = nodeDegFrac;
      maxQId = -1;
    }

    void addQ(int NId, double Q) { 
      nodeToQ.put(NId, Q);
      if (maxQId == -1 || nodeToQ.get(maxQId) < Q) { 
        maxQId = NId;
      } 
    }

    void updateMaxQ() { 
      maxQId=-1; 
      int[] nodeIDs = nodeToQ.keys();
      double maxQ = nodeToQ.get(maxQId);
      for(int i = 0; i < nodeIDs.length; i++){
        int id = nodeIDs[i];
        if(maxQId == -1 || maxQ < nodeToQ.get(id)){
          maxQId = id;
          maxQ = nodeToQ.get(maxQId);
        }
      } 
    }

    void delLink(int K) { 
      int NId=getMxQNId(); 
      nodeToQ.remove(K); 
      if (NId == K) { 
        updateMaxQ(); 
      }  
    }

    int getMxQNId() { 
      return maxQId;
    }

    double getMxQ() {
      return nodeToQ.get(maxQId); 
    }
  };
  private TIntObjectHashMap<CommunityData> communityData = new TIntObjectHashMap<CommunityData>();
  private TreeSet<DoubleIntInt> heap = new TreeSet<DoubleIntInt>();
  private HashMap<DoubleIntInt,DoubleIntInt> set = new HashMap<DoubleIntInt,DoubleIntInt>();
  private double Q = 0.0;
  private UnionFind uf = new UnionFind();
  @Override
  public double getCommunities(CommunityGraph graph) {
    init(graph);
    //CNMMCommunityMetric metric = new CNMMCommunityMetric();
    //metric.getCommunities(graph);
    // maximize modularity
    while (this.mergeBestQ(graph)) {
    }
    // reconstruct communities
    HashMap<Integer, ArrayList<Integer>> IdCmtyH = new HashMap<Integer, ArrayList<Integer>>();
    Iterator<CommunityNode> ns = graph.getNodes();
    int community = 0;
    TIntIntHashMap communities = new TIntIntHashMap();
    while(ns.hasNext()){
      CommunityNode n = ns.next();
      int r = uf.find(n);
      if(!communities.contains(r)){
        communities.put(r, community++);
      }
      n.setCommunity(communities.get(r));
    }
    System.exit(0);
    return this.Q;
  }

  private void init(Graph graph) {
    double M = 0.5/graph.getEdgesList().size();
    Iterator<Node> ns = graph.getNodes();
    while(ns.hasNext()){
      Node n = ns.next();
      uf.add(n);
      int edges = n.getEdgesList().size();
      if(edges == 0){
        continue;
      }
      CommunityData dat = new CommunityData(M * edges, edges);
      communityData.put(n.getId(), dat);
      Iterator<Edge> es = n.getConnections();
      while(es.hasNext()){
        Edge e = es.next();
        Node dest = e.getStart() == n ? e.getEnd() : e.getStart();
        double dstMod = 2 * M * (1.0 - edges * dest.getEdgesList().size() * M);//(1 / (2 * M)) - ((n.getEdgesList().size() * dest.getEdgesList().size()) / ((2 * M) * (2 * M)));// * (1.0 - edges * dest.getEdgesList().size() * M);
        dat.addQ(dest.getId(), dstMod);
      }
      Q += -1.0 * (edges*M) * (edges*M);
      if(n.getId() < dat.getMxQNId()){
        addToHeap(createEdge(dat.getMxQ(), n.getId(), dat.getMxQNId()));
      }
    }
  }
  void addToHeap(DoubleIntInt o){
    heap.add(o);
  }

  DoubleIntInt createEdge(double val1, int val2, int val3){
    DoubleIntInt n = new DoubleIntInt(val1, val2, val3);
    if(set.containsKey(n)){
      DoubleIntInt n1 = set.get(n);
      heap.remove(n1);
      if(n1.val1 < val1){
        n1.val1 = val1;
      }
      n = n1;
    }
    else{
      set.put(n, n);
    }
    return n;
  }
  void removeFromHeap(Collection<DoubleIntInt> col, DoubleIntInt o){
    //set.remove(o);
    col.remove(o);
  }
  DoubleIntInt findMxQEdge() {
    while (true) {
      if (heap.isEmpty()) {
        break; 
      }

      DoubleIntInt topQ = heap.first();
      removeFromHeap(heap, topQ);
      //heap.remove(topQ);
      if (!communityData.containsKey(topQ.val2) || ! communityData.containsKey(topQ.val3)) {
        continue; 
      }
      if (topQ.val1 != communityData.get(topQ.val2).getMxQ() && topQ.val1 != communityData.get(topQ.val3).getMxQ()) { 
        continue; 
      }
      return topQ;
    }
    return new DoubleIntInt(-1.0, -1, -1);
  }
  boolean mergeBestQ(Graph graph) {
    DoubleIntInt topQ = findMxQEdge();
    if (topQ.val1 <= 0.0) { 
      return false; 
    }
    // joint communities
    int i = topQ.val3;
    int j = topQ.val2;
    uf.union(i, j);

    Q += topQ.val1;
    CommunityData datJ = communityData.get(j);
    CommunityData datI = communityData.get(i);
    datI.delLink(j);
    datJ.delLink(i);

    int[] datJData = datJ.nodeToQ.keys();
    for(int _k = 0; _k < datJData.length; _k++){
      int k = datJData[_k];
      CommunityData datK = communityData.get(k);
      double newQ = datJ.nodeToQ.get(k);
      //if(datJ.nodeToQ.containsKey(i)){
      //  newQ = datJ.nodeToQ.get(i);
      //}
      if (datI.nodeToQ.containsKey(k)) { 
        newQ = newQ + datI.nodeToQ.get(k);
        datK.delLink(i);
      }     // K connected to I and J
      else { 
        newQ = newQ - 2 * datI.DegFrac * datK.DegFrac;
      }  // K connected to J not I
      datJ.addQ(k, newQ);
      datK.addQ(j, newQ);
      addToHeap(createEdge(newQ, Math.min(j, k), Math.max(j, k)));
    }

    int[] datIData = datI.nodeToQ.keys();
    for(int _k = 0; _k < datIData.length; _k++){
      int k = datIData[_k];
      if (!datJ.nodeToQ.containsKey(k)) { // K connected to I not J
        CommunityData datK = communityData.get(k);
        double newQ = datI.nodeToQ.get(k) - 2 * datJ.DegFrac * datK.DegFrac; 
        datJ.addQ(k, newQ);
        datK.delLink(i);
        datK.addQ(j, newQ);
        addToHeap(createEdge(newQ, Math.min(j, k), Math.max(j, k)));
      }
    } 
    datJ.DegFrac += datI.DegFrac; 
    if (datJ.nodeToQ.isEmpty()) { 
      communityData.remove(j); 
    } // isolated community (done)
    communityData.remove(i);
    return true;
  }
}

UPDATE:the currently listed code is fairly quick, and has half the memory usage compared to the "quickest" solution, while only being ~5% slower. the difference is in the use of hashmap + treest vs priority queue, and ensuring only a single object for a given i, j pair exists at any time.

like image 250
Zack Newsham Avatar asked Dec 07 '13 15:12

Zack Newsham


1 Answers

So here's the original paper, a neat lil' six pages, only two of which are about the design & implementation. Here's a cliffnotes:

  • For a partition of a given graph, the authors define the modularity, Q, of the partition to be the ratio of the number of edges within each community to the number of edges between each community, minus the ratio you'd expect from a completely random partition.
  • So it's effectively "how much better is this partition at defining communities than a completely random one?"
  • Given two communities i and j of a partition, they then define deltaQ_ij to be how much the modularity of the partition would change if communities i and j were merged. So if deltaQ_ij > 0, merging i and j will improve the modularity of the partition.
  • Which leads to a simple greedy algorithm: start with every node in its own community. Calculate deltaQ_ij for every pair of communities. Whichever two communities i, j have the largest deltaQ_ij, merge those two. Repeat.
  • You'll get maximum modularity when the deltaQ_ij all turn negative, but in the paper the authors let the algorithm run until there's only one community left.

That's pretty much it for understanding the algorithm. The details are in how to compute deltaQ_ij quickly and store the information efficiently.

Edit: Data structure time!

So first off, I think the implementation you're referencing does things a different way to the paper. I'm not quite sure how, because the code is impenetrable, but it seems to use union-find and hashsets in place of the author's binary trees and multiple heaps. Not a clue why they do it a different way. You might want to email the guy who wrote it and ask.

Anyway, the algorithm in the paper needs several things from the format deltaQ is stored in:

  • First, it needs to be able to recover the largest value in dQ quickly.
  • Second, it needs to be able to remove all deltaQ_ik and deltaQ_ki for a fixed i quickly.
  • Third, it needs to be able to update all deltaQ_kj and deltaQ_jk for a fixed j quickly.

The solution the authors come up to for this is as follows:

  • For each community i, each non-zero deltaQ_ik is stored in a balanced binary tree, indexed by k (so elements can be found easily), and in a heap (so the maximum for that community can be found easily).
  • The maximum deltaQ_ik from each community i's heap is then stored in another heap, so that the overall maximums can be found easily.

When community i is merged with community j, several things happen to the binary trees:

  • First, each element from the ith community is added to the jth community's binary tree. If an element with the same index k already exists, you sum the old and new values.
  • Second, we update all the remaining "old" values in the jth community's binary tree to reflect the fact that the jth community has just increased in size.
  • And for each other community's binary tree k, we update any deltaQ_kj.
  • Finally, the tree for community i is thrown away.

And similarly, several things must happen to the heaps:

  • First, the heap for community i is thrown away.
  • Then the heap for community j is rebuilt from scratch using the elements from the community's balanced binary tree.
  • And for each other community k's heap, the position of entry deltaQ_kj is updated.
  • Finally, the entry for community i in the overall heap is thrown away (causing bubbling) and the entries for community j and each community k connected to i or j are updated.

Strangely, when two communities are merged there's no reference in the paper as to removing deltaQ_ki values from the kth community's heap or tree. I think this might be dealt with by the setting of a_i = 0, but I don't understand the algorithm well enough to be sure.

Edit: Trying to decipher the implementation you linked. Their primary datastructures are

  • CmtyIdUF, a union-find structure that keeps track of which nodes are in which community (something that's neglected in the paper, but seems necessary unless you want to reconstruct community membership from a trace of the merge or something),
  • MxQHeap, a heap to keep track of which deltaQ_ij is largest overall. Strangely, when they update the value of a TFltIntIntTr in the heap, they don't ask the heap to re-heapify itself. This is worrying. Does it do it automatically or something?
  • CmtyQH, a hashmap that maps a community ID i to a structure TCmtyDat which holds what looks heap of the deltaQ_ik for that community. I think. Strangely though, the UpdateMaxQ of the TCmtyDat structure takes linear time, obviating any need for a heap. What's more, the UpdateMaxQ method only appears to be called when an element of the heap is deleted. It should definitely also be getting called when the value of any element in the heap is updated.
like image 92
Andy Jones Avatar answered Nov 11 '22 21:11

Andy Jones