Disjoint sets on apache spark

Question

I trying to find algorithm of searching disjoint sets (connected components/union-find) on large amount of data with apache spark. Problem is amount of data. Even Raw representation of graph vertex doesn't fit in to ram on single machine. Edges also doesn't fit in to the ram.

Source data is text file of graph edges on hdfs: "id1 id2".

id present as string value, not int.

Naive solution that I found is:

take rdd of edges -> [id1:id2] [id3:id4] [id1:id3]
group edges by key. -> [id1:[id2;id3]][id3:[id4]]
for each record set minimum id to each group -> (flatMap) [id1:id1][id2:id1][id3:id1][id3:id3][id4:id3]
reverse rdd from stage 3 [id2:id1] -> [id1:id2]
leftOuterJoin of rdds from stage 3 and 4
repeat from stage 2 while size of rdd on step 3 wouldn't change

But this results in the transfer of large amounts of data between nodes (shuffling)

Any advices?

Marsellus Wallace · Accepted Answer

If you are working with graphs I would suggest that you take a look at either one of these libraries

GraphX
GraphFrames

They both provide the connected components algorithm out of the box.

GraphX:

val graph: Graph = ...
val cc = graph.connectedComponents().vertices

GraphFrames:

val graph: GraphFrame = ...
val cc = graph.connectedComponents.run()
cc.select("id", "component").orderBy("component").show()

Disjoint sets on apache spark

Tags:

algorithm

graph-theory

apache-spark

mapreduce

disjoint-sets

Puh

1 Answers

Marsellus Wallace

Recent Activity

Donate For Us

Disjoint sets on apache spark

Tags:

algorithm

graph-theory

apache-spark

mapreduce

disjoint-sets

Puh

1 Answers

Marsellus Wallace

Related questions

Recent Activity

Donate For Us