Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In Spark API, What is the difference between makeRDD functions and parallelize function?

I have a onequestion, during make spark app. In Spark API, What is the difference between makeRDD functions and parallelize function?

like image 652
Lee. YunSu Avatar asked Jul 15 '15 10:07

Lee. YunSu


2 Answers

There is no difference whatsoever. To quote makeRDD doctring:

This method is identical to parallelize.

and if you take a look at the implementation it simply calls parallelize:

def makeRDD[T: ClassTag](
    seq: Seq[T],
    numSlices: Int = defaultParallelism): RDD[T] = withScope {
  parallelize(seq, numSlices)
}

At the end of the day it is a matter of taste. One thing to consider is that makeRDD seems to be specific to Scala API. PySpark and internal SparkR API provide only parallelize.

Note: There is a second implementation of makeRDD which allows you to set location preferences, but given a different signature it is not interchangeable with parallelize.

like image 130
zero323 Avatar answered Nov 16 '22 01:11

zero323


As noted by @zero323, makeRDD has 2 implementations. One is identical to parallelize. The other is a very useful way to inject data locality into your Spark application even if you are not using HDFS.

For example, it provides data locality when your data is already distributed on disk across your Spark cluster according to some business logic. Assume your goal is to create an RDD that will load data from disk and transform it with a function, and you would like to do so while running local to the data as much as possible.

To do this, you can use makeRDD to create an empty RDD with different location preferences assigned to each of your RDD partitions. Each partition can be responsible for loading your data. As long as you fill the partitions with the path to your partition-local data, then execution of subsequent transformations will be node-local.

Seq<Tuple2<Integer, Seq<String>>> rddElemSeq = 
                       JavaConversions.asScalaBuffer(rddElemList).toSeq();
RDD<Integer> rdd = sparkContext.makeRDD(rddElemSeq, ct);
JavaRDD<Integer> javaRDD = JavaRDD.fromRDD(rdd, ct);
JavaRDD<List<String>> keyRdd = javaRDD.map(myFunction);
JavaRDD<myData> myDataRdd = keyRdd.map(loadMyData);

In this snippet, rddElemSeq contains the location preferences for each partition (an IP address). Each partition also has an Integer which acts like a key. My function myFunction consumes that key and can be used to generate a list of paths to my data local to that partition. Then that data can be loaded in the next line.

like image 26
Sumit Avatar answered Nov 16 '22 01:11

Sumit