I want to access data from a particular partition in Spark RDD. I can get address of a partition as follow:
myRDD.partitions(0)
But I want to get data from myRDD.partitions(0)
partition.
I tried official org.apache.spark documentation but couldn't find.
Thanks in advance.
Spark automatically partitions RDDs and distributes the partitions across different nodes. A partition in spark is an atomic chunk of data (logical division of data) stored on a node in the cluster. Partitions are basic units of parallelism in Apache Spark.
Apache Spark's Resilient Distributed Datasets (RDD) are a collection of various data that are so big in size, that they cannot fit into a single node and should be partitioned across various nodes. Apache Spark automatically partitions RDDs and distributes the partitions across different nodes.
One important parameter for parallel collections is the number of partitions to cut the dataset into. Spark will run one task for each partition of the cluster. Typically you want 2-4 partitions for each CPU in your cluster.
You can use mapPartitionsWithIndex
as follows
// Create (1, 1), (2, 2), ..., (100, 100) dataset
// and partition by key so we know what to expect
val rdd = sc.parallelize((1 to 100) map (i => (i, i)), 16)
.partitionBy(new org.apache.spark.HashPartitioner(8))
val zeroth = rdd
// If partition number is not zero ignore data
.mapPartitionsWithIndex((idx, iter) => if (idx == 0) iter else Iterator())
// Check if we get expected results 8, 16, ..., 96
assert (zeroth.keys.map(_ % 8 == 0).reduce(_ & _) & zeroth.count == 12)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With