Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the efficient way to update value inside Spark's RDD?

I'm writing a graph-related program in Scala with Spark. The dataset have 4 million nodes and 4 million edges(you can treat this as a tree), but for each time(an Iteration), I only edit a portion of it, namely a sub-tree rooted by a given node, and the nodes in a path between that given node and root.

The Iteration has dependency, which means i+1 Iteration needs the result coming from i. So I need store the result of each Iteration for next step.

I'm trying to find an efficient way to update RDD, but have no clue so far.I find that PairRDD have a lookup function which could reduce the computation time from O(N), to O(M), N denote the total number of objects in RDD and M denote the number of elements in each partition.

So I'm thinking is there anyway that I could update an object in the RDD with O(M)? Or more ideally, O(1)?(I see an email in Spark's mail list saying that the lookup can be modified to achieve O(1))

Another thing is, if I could achieve O(M) for updating the RDD, could I increase the partition to some number larger than the number of cores I have and achieve a better performance?

like image 650
bxshi Avatar asked Jun 10 '14 03:06

bxshi


2 Answers

As functional data structures, RDDs are immutable and an operation on an RDD generates a new RDD.

Immutability of the structure does not necessarily mean full replication. Persistant data structures are a common functional pattern where operations on immutable structures yield a new structure but previous versions are maintained and often reused.

GraphX (a 'module' on top of Spark) is a graph API on top of Spark that uses such concept: From the docs:

Changes to the values or structure of the graph are accomplished by producing a new graph with the desired changes. Note that substantial parts of the original graph (i.e., unaffected structure, attributes, and indicies) are reused in the new graph reducing the cost of this inherently functional data-structure.

It might be a solution for the problem at hand: http://spark.apache.org/docs/1.0.0/graphx-programming-guide.html

like image 50
maasg Avatar answered Sep 30 '22 12:09

maasg


An RDD is a distributed data set, a partition is the unit for RDD storage, and the unit to process and RDD is an element.

For example, you read a large file from HDFS as an RDD, then the element of this RDD is String(lines in that file), and spark stores this RDD across the cluster by partition. For you, as a spark user, you only need to care about how to deal with the lines of that files, just like you are writing a normal program, and you read a file from local file system line by line. That's the power of spark:)

Anyway, you have no idea which elements will be stored in a certain partition, so it doesn't make sense to update a certain partition.

like image 37
cloud Avatar answered Sep 30 '22 12:09

cloud