Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance Impact of RDD to JavaRDD conversion

I have a code something like this and I want to work on JavaRDD instead of RDD. So, I'm doing conversion here. I would like to know the performance impact of this transformation specially when I'm dealing with GBs of data.

RDD<String> textFile = sc.textFile(filePath, 2);
JavaRDD<String> javaRDD = textFile.toJavaRDD(); 

Is this wide transformation or narrow ? What is the difference between JavaRDD and RDD ?

like image 874
Balaji Reddy Avatar asked May 28 '16 09:05

Balaji Reddy


People also ask

What is the main overhead of RDD?

There is no significant overhead when converting one Dataframe to RDD with df. rdd since the dataframes they already keep an instance of their RDDs initialized therefore returning a reference to this RDD should not have any additional cost.

What is JavaRDD spark?

(If you're new to Spark, JavaRDD is a distributed collection of objects, in this case lines of text in a file. We can apply operations to these objects that will automatically be parallelized across a cluster.)

Which function is used to pipe each partition of the RDD through a shell command?

This operation is also called groupWith. When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements). Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. It decreases the number of partitions in the RDD to numPartitions.

What is action in spark RDD?

Thus, Actions are Spark RDD operations that give non-RDD values. The values of action are stored to drivers or to the external storage system. It brings laziness of RDD into motion. An action is one of the ways of sending data from Executer to the driver. Executors are agents that are responsible for executing a task.


1 Answers

There's no significant performance penalty - JavaRDD is a simple wrapper around RDD just to make calls from Java code more convenient. It holds the original RDD as its member, and calls that member's method on any method invocation, for example (from JavaRDD.scala):

def cache(): JavaRDD[T] = wrapRDD(rdd.cache()) 

wrapRDD boils down to something like new JavaRDD[T](rdd), so the only performance penalty is creating a thin Java object for every method invocation, but that's entirely negligible as it's not done per element in the RDD, but once for the entire object.

like image 158
Tzach Zohar Avatar answered Sep 22 '22 08:09

Tzach Zohar