Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm to identify "fuzzily-connected" subgraphs

I have a problem which looks like the connectet subgraphs problem from a mile-high, but is quite distinct in that it does not fall under the strict definitions.

I face a graph with a few millions of nodes and links (manual analysis is not possible), among those millions of nodes, there are known to be 2 or 3 "sets".

Each of the "sets" is comprised of hundreds of thousandths of nodes, and tens of thousandths sub-graphs, not strongly connected. Each of those sets should theoretically not be linked to the other sets... but there are (guesstimation) a dozen of erroneous links that end up connecting those sets.

The problem is to find those sets and the erroneous links, or at least get a human-manageable list of erroneous links candidates that can be verified manually.

My current "best idea" is to randomly pick two nodes, find the shortest path between them, then mark the links on that shortest path. Rinse & repeat millions of times, and the erroneous links eventually end up as the most marked ones, as they are "chokepoints" between the sets.

However, this is quite slow, and when one set is much larger than the others and has internal chokepoints, it ends up dominating the "most marked" list, making it meaningless.

Are there better algorithms/approaches for that?

edit: a refinement of the path marking is to mark proportionally with the length of the path, which helps with the "internal chokepoints of a large set" issue, but does not entirely eliminates it as some sets can have distant "outliers", while other sets have lots of tightly connected nodes (short internal distances)

like image 664
Eric Grange Avatar asked Nov 09 '22 21:11

Eric Grange


1 Answers

My idea is an ant colony algorithm. I get inspired with your approach of choosing two random nodes, but thought it will be useful to do something more instead of just computing the shortest path.

Start n ants in n random nodes. You will need to adjust n with trial and error method. Ants leave a pheromone on traveled edges. Pheromone evaporate in time. Ants choose one of the distinct edges to travel according to the probability. The more pheromone an edge has, the more likely an ant will choose that edge.

In the beginning ants move totally randomly, since there is no pheromone and edges have the same probability. However, over time, the most popular edges, bridges between two "fuzzily-connected" components will have more and more pheromone on them.

So, you throw n ants, simulate for m turns and return edges with the highest amount of pheromone on them. You can visualize this process to clearly see what is going on.

Update: I realized that the sentence "However, over time, the most popular edges, bridges between two "fuzzily-connected" components will have more and more pheromone on them" is wrong. I implemented it and it looks like most of the time bridges not necessarily attract ants:

enter image description here

There were n = 1000 ants and m = 1000 steps. Initially every edge had 1 unit of pheromone and it was increased by 1 if ant traveled over it. No evaporation, however I think it would not improve the situation. Bridge had 49845 units of pheromone, but there were three other edges which had more than 100k.


As suggested by Peter de Rivaz in the comment, I tried (source code) repeating min-cut between 2 random nodes and it is much better:

enter image description here

Graphs generated with python-igraph library.

like image 164
Adam Stelmaszczyk Avatar answered Nov 15 '22 07:11

Adam Stelmaszczyk