Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Distributed local clustering coefficient algorithm (MapReduce/Hadoop)

I have implemented MapReduce paradigm based local clustering coefficient algorithm. However I have run into serious troubles for bigger datasets or specific datasets (high average degree of a node). I tried to tune my hadoop platform and the code however the results were unsatisfactory (to say the least). No I have turned my attention to actually change/improve the algorithm. Below is my current algorithm (pseudo code)

foreach(Node in Graph) {
  //Job1
  /* Transform edge-based input dataset to node-based dataset */

  //Job2
  map() {
   emit(this.Node, this.Node.neighbours) //emit myself data to all my neighbours
   emit(this.Node, this.Node) //emit myself to myself
  }

  reduce() {
    NodeNeighbourhood nodeNeighbourhood;
    while(values.hasNext) {
      if(myself)
        this.nodeNeighbourhood.setCentralNode(values.next) //store myself data
      else
        this.nodeNeighbourhood.addNeighbour(values.next)  //store neighbour data
    }

    emit(null, this.nodeNeighbourhood)
  }

  //Job3
  map() {
    float lcc = calculateLocalCC(this.nodeNeighbourhood)
    emit(0, lcc) //emit all lcc to specific key, combiners are used
  }

  reduce() {
    float combinedLCC;
    int numberOfNodes;
    while(values.hasNext) {
      combinedLCC += values.next;
    }

    emit(null, combinedLCC/numberOfNodes); // store graph average local clustering coefficient
  }
}

Little bit more details about the code. For directed graphs neighbour data is restricted to node ID and OUT edges destination IDs (to decrease the data size), for undirected its also node ID and edges destination IDs. Sort and Merge buffers are increased to 1.5 Gb, merge streams 80.

It can be clearly seen that Job2 is the actual problem of the whole algorithm. It generates massive amount of data that has to be sorted/copied/merged. This basically kills my algorithm performance for certain datasets. Can someone guide me on how to improve the algorithm (I was thinking about creating an iterative Job2 ["process" only M nodes out of N in each iteration until every node is "processed"], but I have abandoned this idea for now). In my opinion the Job2 map-output should be decreased, to avoid costly sort/merge processes, which kill the performance.

I have also implemented the same algorithm (3 Jobs as well, same "communication" pattern, also "Job2" problem) for the Giraph platform. However Giraph is an in-memory platform and the algorithm for the same "problematic" datasets results in an OutOfMemoryException.

For any comment, remark, guideline I will be grateful.


UPDATE

I'm going to change the algorithm "drastically". I've found this article Counting Triangles.

Once the code is implemented I'm gonna post my opinion here and more detailed code (if this approach will be successful).


UPDATE_2

In the end I've ended "modifying" NodeIterator++ algorithm to my needs (Yahoo paper is available through a link in the article). Unfortunately though I can see an improvement in the performance the end result is not as good as I have hoped. The conclusion I have reached is that the cluster which is available to me is just too small to make the LCC calculations feasible for these specific datasets. So the question remains, or rather it evolves. Does any one know of an efficient distributed/sequential algorithm for calculating LCC or triangles with limited resources available? (By no means I am stating that the NodeIterator++ algorithm is bad, I simple state that the resources which are available to me are just not sufficient).

like image 651
alien01 Avatar asked Jun 10 '12 10:06

alien01


1 Answers

In the paper "MapReduce in MPI for large scale graph algorithms" the authors give a nice description of a MapReduce implementation of Triangle Counting. The paper is available here: http://www.sciencedirect.com/science/article/pii/S0167819111000172 but you may need an account to access the paper. (I'm on a University system that's paid for the subscription, so I never know what I only have access to because they've already paid.) The authors may have a draft of the paper posted on the personal website(s).

There is another way you could count triangles--probably much less efficient unless your graph is fairly dense. First, construct the adjacency matrix of your graph, A. Then compute A^3 (you could do the matrix multiplication in parallel pretty easily). Then, sum up the (i,i) entries of A^3 and divide the answer by 6. That'll give you the number of triangles because the i,j entry of A^k counts the number of length k walks from i to j and since we are only looking at length 3 walks, any walk that starts at i and ends at i after 3 steps is a triangle... overcounting by a factor of 6. This is mainly less efficient because the size of the matrix will be very large compared to the size of an edgelist if your graph is sparse.

like image 94
TravisJ Avatar answered Oct 05 '22 09:10

TravisJ