I have a bipartite graph and I'm looking for the most efficient iterative way to divide it into connected components. My recursive version has started overflowing the stack on large data sets. I'm willing to port from any language/pseudocode but for completeness I'll be coding in C#.
My existing code is specialized for my data types. One partition is proteins, the other is spectra. Map and Set are C++ stdlib workalikes.
void recursivelyAssignProteinToCluster (long proteinId,
long clusterId,
Set<long> spectrumSet,
Map<long, Set<long>> spectrumSetByProteinId,
Map<long, Set<long>> proteinSetBySpectrumId,
Map<long, long> clusterByProteinId)
{
// try to assign the protein to the current cluster
var insertResult = clusterByProteinId.Insert(proteinId, clusterId);
if (!insertResult.WasInserted)
return;
// recursively add all "cousin" proteins to the current cluster
foreach (long spectrumId in spectrumSet)
foreach (var cousinProteinId in proteinSetBySpectrumId[spectrumId])
{
if (proteinId != cousinProteinId)
{
Set<long> cousinSpectrumSet = spectrumSetByProteinId[cousinProteinId];
recursivelyAssignProteinToCluster(cousinProteinId,
clusterId,
cousinSpectrumSet,
spectrumSetByProteinId,
proteinSetBySpectrumId,
clusterByProteinId);
}
}
}
Map<long, long> calculateProteinClusters (NHibernate.ISession session)
{
var spectrumSetByProteinId = new Map<long, Set<long>>();
var proteinSetBySpectrumId = new Map<long, Set<long>>();
var query = session.CreateQuery("SELECT pi.Protein.id, psm.Spectrum.id " + GetFilteredQueryString(FromProtein, ProteinToPeptideSpectrumMatch));
foreach (var queryRow in query.List<object[]>())
{
long proteinId = (long) queryRow[0];
long spectrumId = (long) queryRow[1];
spectrumSetByProteinId[proteinId].Add(spectrumId);
proteinSetBySpectrumId[spectrumId].Add(proteinId);
}
var clusterByProteinId = new Map<long, long>();
int clusterId = 0;
foreach (var pair in spectrumSetByProteinId)
{
long proteinId = pair.Key;
// for each protein without a cluster assignment, make a new cluster
if (!clusterByProteinId.Contains(proteinId))
{
++clusterId;
recursivelyAssignProteinToCluster(proteinId,
clusterId,
pair.Value,
spectrumSetByProteinId,
proteinSetBySpectrumId,
clusterByProteinId);
}
}
return clusterByProteinId;
}
As ShinTakezou suggested I refactored to put the stack on the heap and it works great. I used the DepthFirstSearch approach from digEmAll's example.
var clusterByProteinId = new Map<long, long>();
int clusterId = 0;
var clusterStack = new Stack<KeyValuePair<long, Set<long>>>();
foreach (var pair in spectrumSetByProteinId)
{
long proteinId = pair.Key;
if (clusterByProteinId.Contains(proteinId))
continue;
// for each protein without a cluster assignment, make a new cluster
++clusterId;
clusterStack.Push(new KeyValuePair<long, Set<long>>(proteinId, spectrumSetByProteinId[proteinId]));
while (clusterStack.Count > 0)
{
var kvp = clusterStack.Pop();
// try to assign the protein to the current cluster
var insertResult = clusterByProteinId.Insert(kvp.Key, clusterId);
if (!insertResult.WasInserted)
continue;
// add all "cousin" proteins to the current cluster
foreach (long spectrumId in kvp.Value)
foreach (var cousinProteinId in proteinSetBySpectrumId[spectrumId])
if (!clusterByProteinId.Contains(cousinProteinId))
clusterStack.Push(new KeyValuePair<long, Set<long>>(cousinProteinId, spectrumSetByProteinId[cousinProteinId]));
}
}
Connected components are the set of its connected subgraphs. Two nodes belong to the same connected component when there exists a path (without considering the direction of the edges) between them. Therefore, the algorithm does not consider the direction of edges.
It is based on the idea that if one is able to reach a vertex v starting from vertex u, then one should be able to reach vertex u starting from vertex v and if such is the case, one can say that vertices u and v are strongly connected - they are in a strongly connected sub-graph.
Breadth-first search (BFS) is used for finding all connected components in a graph (finding all nodes within one connected component). The set of nodes reached by a BFS are the largest connected component containing the start node.
Connected component labeling works by scanning an image, pixel-by-pixel (from top to bottom and left to right) in order to identify connected pixel regions, i.e. regions of adjacent pixels which share the same set of intensity values V.
Here's an example of an helper class that holds an undirected Graph and allows to get the connected components of it (iteratively):
public class Graph<T>
{
public Dictionary<T, HashSet<T>> nodesNeighbors;
public IEnumerable<T> Nodes
{
get { return nodesNeighbors.Keys; }
}
public Graph()
{
this.nodesNeighbors = new Dictionary<T, HashSet<T>>();
}
public void AddNode(T node)
{
this.nodesNeighbors.Add(node, new HashSet<T>());
}
public void AddNodes(IEnumerable<T> nodes)
{
foreach (var n in nodes)
this.AddNode(n);
}
public void AddArc(T from, T to)
{
this.nodesNeighbors[from].Add(to);
this.nodesNeighbors[to].Add(from);
}
public bool ContainsNode(T node)
{
return this.nodesNeighbors.ContainsKey(node);
}
public IEnumerable<T> GetNeighbors(T node)
{
return nodesNeighbors[node];
}
public IEnumerable<T> DepthFirstSearch(T nodeStart)
{
var stack = new Stack<T>();
var visitedNodes = new HashSet<T>();
stack.Push(nodeStart);
while (stack.Count > 0)
{
var curr = stack.Pop();
if (!visitedNodes.Contains(curr))
{
visitedNodes.Add(curr);
yield return curr;
foreach (var next in this.GetNeighbors(curr))
{
if (!visitedNodes.Contains(next))
stack.Push(next);
}
}
}
}
public Graph<T> GetSubGraph(IEnumerable<T> nodes)
{
Graph<T> g = new Graph<T>();
g.AddNodes(nodes);
foreach (var n in g.Nodes.ToList())
{
foreach (var neigh in this.GetNeighbors(n))
g.AddArc(n, neigh);
}
return g;
}
public IEnumerable<Graph<T>> GetConnectedComponents()
{
var visitedNodes = new HashSet<T>();
var components = new List<Graph<T>>();
foreach (var node in this.Nodes)
{
if (!visitedNodes.Contains(node))
{
var subGraph = GetSubGraph(this.DepthFirstSearch(node));
components.Add(subGraph);
visitedNodes.UnionWith(subGraph.Nodes);
}
}
return components;
}
}
Usage:
static void Main(string[] args)
{
var g = new Graph<long>();
g.AddNodes(new long[] { 1, 2, 3, 4, 5, 6, 7, 8, 9 });
g.AddArc(1, 2);
g.AddArc(1, 3);
g.AddArc(9, 6);
g.AddArc(6, 7);
g.AddArc(6, 8);
g.AddArc(4, 5);
var subGraphs = g.GetConnectedComponents();
}
You could use the Graph<>
class instead of your maps, or if you want to stick with your maps have a look at the code that is quite easy to understand (inside the class it is used a Dictionary<T,HashSet<T>>
to hold nodes and arcs, so is very similar to your approach)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With