Semi clustering algorithm is mentioned in the Google Pregel paper. The score of a semi cluster is calculated using the below formula
where
Ic is sum of the weights of all the internal edges
Bc is the sum of the weights of all the boundary edges
Vc is the number of vertices in the semi cluster and
fb is the boundary edge score factor (user defined between 0 and 1)
The algorithm was pretty straight forward, but I could not understand how the above formula was arrived. Note that the denominator is the number of edges possible between Vc number of vertices.
Could someone please explain it?
The score makes sense if you think about the quantity that it is meant to capture.
The problem being addressed here is figuring out what's the best way to place vertices of a graph into semi-clusters (simply a group of vertices where each vertex can be in more than one semi-cluster) with some upper bound on the total number of semi-cluster. So one method of finding the "best" way is to assign a score to any potential semi-cluster (in other words, to any arbitrary group of vertices). Then the problem becomes that of maximizing the total score.
So, semi-cluster are meant to capture cliques in a graph. For example in a social graph, a semi-cluster might be the members of a high school basketball team.
Thus more interior edges equates to a "better" semi-cluster. This explains the I_c
in the numerator. Similarly, you want to have very few boundary edges since if there are a lot of boundary edges, then that means there is likely to be a better semi-group containing the one you are examining. This gives the -f_b * B_c
in the numerator. f_b
is simply a scaling factor so that you can tune how much penalty you want to assign the boundary edges.
The denominator is also kind of a scaling factor. It is used to normalize the semi-cluster scores so that small clusters don't get completely dominated by larger ones. An extreme example of this is if you consider the semi-group of everyone in the world. Clearly there are no boundary edges and tons of interior edges, but it is undoubtably a less useful semi-group than the high school basketball team.
It is related to cliques.
V_c * (V_c - 1) is the number of edges in a clique of size V_c.
So if you take the sum over all edges in the group I_c, this is the appropriate normalization to get an arithmetic mean.
I.e. I_c / (V_c * (V_c - 1)) is the mean weight inside the clique.
Now the - f_B * B_c term is a penalty for outgoing edges. IMHO, it should only be divided by V_c, but that is personal taste, as I'd assume the expected outgoing edges to scale with the number of clique members, not with the square of this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With