I'm writing a graph-related program in Scala
with Spark
. The dataset have 4 million nodes and 4 million edges(you can treat this as a tree), but for each time(an Iteration
), I only edit a portion of it, namely a sub-tree rooted by a given node, and the nodes in a path between that given node and root.
The Iteration
has dependency, which means i+1
Iteration
needs the result coming from i
. So I need store the result of each Iteration
for next step.
I'm trying to find an efficient way to update RDD
, but have no clue so far.I find that PairRDD
have a lookup
function which could reduce the computation time from O(N)
, to O(M
), N
denote the total number of objects in RDD
and M
denote the number of elements in each partition.
So I'm thinking is there anyway that I could update an object in the RDD
with O(M)
? Or more ideally, O(1)?(I see an email in Spark's mail list saying that the lookup
can be modified to achieve O(1))
Another thing is, if I could achieve O(M)
for updating the RDD
, could I increase the partition to some number larger than the number of cores I have and achieve a better performance?
As functional data structures, RDDs are immutable and an operation on an RDD generates a new RDD.
Immutability of the structure does not necessarily mean full replication. Persistant data structures are a common functional pattern where operations on immutable structures yield a new structure but previous versions are maintained and often reused.
GraphX (a 'module' on top of Spark) is a graph API on top of Spark that uses such concept: From the docs:
Changes to the values or structure of the graph are accomplished by producing a new graph with the desired changes. Note that substantial parts of the original graph (i.e., unaffected structure, attributes, and indicies) are reused in the new graph reducing the cost of this inherently functional data-structure.
It might be a solution for the problem at hand: http://spark.apache.org/docs/1.0.0/graphx-programming-guide.html
An RDD is a distributed data set, a partition is the unit for RDD storage, and the unit to process and RDD is an element.
For example, you read a large file from HDFS as an RDD, then the element of this RDD is String
(lines in that file), and spark stores this RDD across the cluster by partition. For you, as a spark user, you only need to care about how to deal with the lines of that files, just like you are writing a normal program, and you read a file from local file system line by line. That's the power of spark:)
Anyway, you have no idea which elements will be stored in a certain partition, so it doesn't make sense to update a certain partition.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With