Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference and use-cases of RDD and Pair RDD

Tags:

People also ask

What is pair RDD when to use them?

Paired RDD is a distributed collection of data with the key-value pair. It is a subset of Resilient Distributed Dataset So it has all the features of RDD and some new feature for the key-value pair. There are many transformation operations available for Paired RDD.

For what kind of tasks paired RDDs are preferred over basic RDDs?

They are useful because they allow us to act on each key in parallel or regroup data across the network. Pair RDDs can be created from already existing regular RDDs for example by using the map operation on the regular RDD: val rdd: RDD[WikipediaPage] = ...

What are the two types of RDD operations?

RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.

What are the limitations of RDDs?

There are some drawbacks of using RDDs though: RDD code can sometimes be very opaque. Developers might struggle to find out what exactly the code is trying to compute. RDDs cannot be optimized by Spark, as Spark cannot look inside the lambda functions and optimize the operations.


I am new to spark and trying to understand the difference between normal RDD and a pair RDD. What are the use-cases where a pair RDD is used as opposed to a normal RDD? If possible, I want to understand the internals of pair RDD with an example. Thanks