There is an excellent comparison of community detection algorithms available in igraph here. However, there's some ambiguity about the use of weights in the algorithms that can be applied with weighted edges.
Typically, edge weights will be oriented so that higher weights suggest keeping the nodes together (eg strength of friendship). This works nicely with modularity scores by comparing average weighted density within and externally.
However, the Newman-Girvan community detection algorithm uses betweenness, which is based on distances. In this case, I would expect that the edge weights should reflect the distance between nodes so that calculating shortest paths sums the weights over the path. That is, the weight is a cost or distance score, where higher values should break into different communities.
Am I correct in expecting higher weights for greater distances when using Newman-Girvan and, if so, how then does this reconcile with using modularity to decide where to cut the number of communities?
Weighted Graphs. ❖ A weighted graph is a 3-tuple g=(V,E,w), where V is the. set of nodes, E is the set of edges, and w:E→R (R is the. set of reals) is a function that assigns a weight to each edge. ❖ The definition applies to both directed and undirected.
What are community detection algorithms? Community detection algorithms are used to evaluate how groups of nodes are clustered or partitioned, as well as their tendency to strengthen or break apart. The Neo4j Graph Data Science Library supports many different centrality algorithms.
Girvan-Newman Algorithm for Community Detection. Under the Girvan-Newman algorithm, the communities in a graph are discovered by iteratively removing the edges of the graph, based on the edge betweenness centrality value. The edge with the highest edge betweenness is removed first.
Community detection methods can be broadly categorized into two types; Agglomerative Methods and Divisive Methods. In Agglomerative methods, edges are added one by one to a graph which only contains nodes. Edges are added from the stronger edge to the weaker edge.
My answer is going to be based on the igraph
package in R. The situation is indeed quite confusing and the questions are relevant since, as Newman (2004) says,
Since the publication of that work, the author has been asked a number of times whether an appropriate generalization of the algorithm exists for weighted networks.
In his paper he derives an appropriate generalization of the Newman-Girvan algorithm to weighted networks.
You are right about the interpretation of weights in the Newman-Girvan algorithm. edge_betweenness
uses a formula analogous to that in (Brandes, 2001), where the length of a path is defined as the sum of the weights of its edges. (You may also check the source code but it's quite involved). In ?edge_betweenness
and, in particular, ?cluster_edge_betweenness
it says
Edge weights are used to calculate weighted edge betweenness. This means that edges are interpreted as distances, not as connection strengths.
The implications are as follows. Let b(e, w) be the edge betweenness of an edge e with weight w. Then it can be shown (I could elaborate if you wish) that
b(e, w) <= b(e, w*) if and only if w >= w*.
That is, the edge betweenness and the weight of e are inversely related. The main idea is that given, e.g., w* >> w, those shortest paths that were crossing e now are likely to be dominated by some other paths that do not include e. Hence, larger weight implies (weakly) lower betweenness, and lower betweenness makes it less likely that e will be recognized as an edge connecting two communities. Thus, this sounds strange if we see weights as distances. On the other hand, if e is within some community and we decrease its weight, then the number of shortest paths through that edge potentially increases and it becomes more likely to be seen as connecting two communities. I am not yet claiming anything about the corresponding modularity scores, though.
Now let's suppose that weights actually correspond to connection strengths. Then the stronger the connection is, the fewer shortest paths will go through that edge (because we still need to compute them), the lower its edge betweenness is, and the less likely it is to be removed. So that kind of makes sense.
What is not nice, or rather strange, is that now the length of a path is defined as the sum of its connection strengths. However, we can reinterpret the algorithm then. Suppose that the weights are >> 1 within communities and << 1 between them. Then we can interpret the length of a path as the privacy of this path (e.g., a path within a community would contain lots of close interactions, while the edge connecting two communities is somewhat public, open). Given such an interpretation, the algorithm would look for the least private / the most open paths and compute the corresponding betweenness. Then we would be removing such edges that belong to many most open paths.
So perhaps I made a mistake somewhere, but it looks like it would make more sense to see weights as connection strengths.
Newman (2004) does something related:
...we will consider specifically those networks in which the weights on edges take greater values for vertex pairs that have closer connections or are more similar in some way.
It would seem that it should make sense. However, as to keep the more natural definition of the shortest path he writes:
One can define paths on a weighted network by assuming the “length” of an edge to vary inversely with its weight, so that two vertices that are connected twice as strongly will be half as far apart.
That is, the shortest path lengths now are inversely related to the weights. Since not doing that seemed to give good results, now we have a problem:
To see this, notice that any two vertices that are particularly strongly connected to one another will have a particularly short distance along the edge between them. Geodesic paths will thus, all other things being equal, prefer to flow along such an edge than along another longer edge between two less well connected vertices, and hence closely connected pairs will tend to attract a lot of paths and acquire high betweenness. This means that, as a general rule, we are more likely to remove edges between well connected pairs than we are between poorly connected pairs, and this is the precise opposite of what we would like the algorithm to do.
Which is the result that I described when we see weights as distances. As I mentioned in the beginning of the answer, to deal with this Newman (2004) proposes mapping weighted graphs to unweighted multigraphs and then proceeding very similarly as in the standard case. I believe that this multigraph idea can be implemented by setting weighted = NULL
but having not a binary adjacency matrix (when defining a graph; see weighted
in ?graph_from_adjacency_matrix
).
First of all, one can use modularity with weighted graphs, as Newman (2004) does, that's not a problem. In general, it's so not obvious how using weights affects using modularity as the way to choose the number of communities. I'll perhaps add some examples with R. It seems like there should be an improvement over the unweighted case, as Newman (2004) finds, when the interpretation is in line with the way algorithm works. Otherwise, I think the graph structure and the weights itself can be quite important to describe the degree of how far from the truth we get.
References
Newman, M.E., 2004. Analysis of weighted networks. Physical review E, 70(5).
Brandes, U., 2001. A faster algorithm for betweenness centrality. Journal of mathematical sociology, 25(2), pp.163-177.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With