Paired RDD is a distributed collection of data with the key-value pair. It is a subset of Resilient Distributed Dataset So it has all the features of RDD and some new feature for the key-value pair. There are many transformation operations available for Paired RDD.
They are useful because they allow us to act on each key in parallel or regroup data across the network. Pair RDDs can be created from already existing regular RDDs for example by using the map operation on the regular RDD: val rdd: RDD[WikipediaPage] = ...
RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.
There are some drawbacks of using RDDs though: RDD code can sometimes be very opaque. Developers might struggle to find out what exactly the code is trying to compute. RDDs cannot be optimized by Spark, as Spark cannot look inside the lambda functions and optimize the operations.
I am new to spark and trying to understand the difference between normal RDD and a pair RDD. What are the use-cases where a pair RDD is used as opposed to a normal RDD? If possible, I want to understand the internals of pair RDD with an example. Thanks
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With