I have a onequestion, during make spark app.
In Spark API, What is the difference between makeRDD
functions and parallelize
function?
There is no difference whatsoever. To quote makeRDD doctring:
This method is identical to
parallelize
.
and if you take a look at the implementation it simply calls parallelize
:
def makeRDD[T: ClassTag](
seq: Seq[T],
numSlices: Int = defaultParallelism): RDD[T] = withScope {
parallelize(seq, numSlices)
}
At the end of the day it is a matter of taste. One thing to consider is that makeRDD
seems to be specific to Scala API. PySpark and internal SparkR API provide only parallelize
.
Note: There is a second implementation of makeRDD
which allows you to set location preferences, but given a different signature it is not interchangeable with parallelize
.
As noted by @zero323, makeRDD has 2 implementations. One is identical to parallelize. The other is a very useful way to inject data locality into your Spark application even if you are not using HDFS.
For example, it provides data locality when your data is already distributed on disk across your Spark cluster according to some business logic. Assume your goal is to create an RDD that will load data from disk and transform it with a function, and you would like to do so while running local to the data as much as possible.
To do this, you can use makeRDD to create an empty RDD with different location preferences assigned to each of your RDD partitions. Each partition can be responsible for loading your data. As long as you fill the partitions with the path to your partition-local data, then execution of subsequent transformations will be node-local.
Seq<Tuple2<Integer, Seq<String>>> rddElemSeq =
JavaConversions.asScalaBuffer(rddElemList).toSeq();
RDD<Integer> rdd = sparkContext.makeRDD(rddElemSeq, ct);
JavaRDD<Integer> javaRDD = JavaRDD.fromRDD(rdd, ct);
JavaRDD<List<String>> keyRdd = javaRDD.map(myFunction);
JavaRDD<myData> myDataRdd = keyRdd.map(loadMyData);
In this snippet, rddElemSeq
contains the location preferences for each partition (an IP address). Each partition also has an Integer
which acts like a key. My function myFunction
consumes that key and can be used to generate a list of paths to my data local to that partition. Then that data can be loaded in the next line.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With