In Spark API, What is the difference between makeRDD functions and parallelize function?

Question

I have a onequestion, during make spark app. In Spark API, What is the difference between makeRDD functions and parallelize function?

zero323 · Accepted Answer

There is no difference whatsoever. To quote makeRDD doctring:

This method is identical to parallelize.

and if you take a look at the implementation it simply calls parallelize:

def makeRDD[T: ClassTag](
    seq: Seq[T],
    numSlices: Int = defaultParallelism): RDD[T] = withScope {
  parallelize(seq, numSlices)
}

At the end of the day it is a matter of taste. One thing to consider is that makeRDD seems to be specific to Scala API. PySpark and internal SparkR API provide only parallelize.

Note: There is a second implementation of makeRDD which allows you to set location preferences, but given a different signature it is not interchangeable with parallelize.

Sumit · Answer

As noted by @zero323, makeRDD has 2 implementations. One is identical to parallelize. The other is a very useful way to inject data locality into your Spark application even if you are not using HDFS.

For example, it provides data locality when your data is already distributed on disk across your Spark cluster according to some business logic. Assume your goal is to create an RDD that will load data from disk and transform it with a function, and you would like to do so while running local to the data as much as possible.

To do this, you can use makeRDD to create an empty RDD with different location preferences assigned to each of your RDD partitions. Each partition can be responsible for loading your data. As long as you fill the partitions with the path to your partition-local data, then execution of subsequent transformations will be node-local.

Seq<Tuple2<Integer, Seq<String>>> rddElemSeq = 
                       JavaConversions.asScalaBuffer(rddElemList).toSeq();
RDD<Integer> rdd = sparkContext.makeRDD(rddElemSeq, ct);
JavaRDD<Integer> javaRDD = JavaRDD.fromRDD(rdd, ct);
JavaRDD<List<String>> keyRdd = javaRDD.map(myFunction);
JavaRDD<myData> myDataRdd = keyRdd.map(loadMyData);

In this snippet, rddElemSeq contains the location preferences for each partition (an IP address). Each partition also has an Integer which acts like a key. My function myFunction consumes that key and can be used to generate a list of paths to my data local to that partition. Then that data can be loaded in the next line.

In Spark API, What is the difference between makeRDD functions and parallelize function?

Tags:

scala

apache-spark

rdd

Lee. YunSu

2 Answers

zero323

Sumit

Recent Activity

Donate For Us

In Spark API, What is the difference between makeRDD functions and parallelize function?

Tags:

scala

apache-spark

rdd

Lee. YunSu

2 Answers

zero323

Sumit

Related questions

Recent Activity

Donate For Us