Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get data from a specific partition in Spark RDD?

I want to access data from a particular partition in Spark RDD. I can get address of a partition as follow:

myRDD.partitions(0)

But I want to get data from myRDD.partitions(0) partition. I tried official org.apache.spark documentation but couldn't find.

Thanks in advance.

like image 805
Vikash Pareek Avatar asked Sep 11 '15 07:09

Vikash Pareek


People also ask

How is the data in these RDDs partitioned by default?

Spark automatically partitions RDDs and distributes the partitions across different nodes. A partition in spark is an atomic chunk of data (logical division of data) stored on a node in the cluster. Partitions are basic units of parallelism in Apache Spark.

Can RDD be partitioned?

Apache Spark's Resilient Distributed Datasets (RDD) are a collection of various data that are so big in size, that they cannot fit into a single node and should be partitioned across various nodes. Apache Spark automatically partitions RDDs and distributes the partitions across different nodes.

How many partitions should a Spark RDD have?

One important parameter for parallel collections is the number of partitions to cut the dataset into. Spark will run one task for each partition of the cluster. Typically you want 2-4 partitions for each CPU in your cluster.


1 Answers

You can use mapPartitionsWithIndex as follows

// Create (1, 1), (2, 2), ..., (100, 100) dataset
// and partition by key so we know what to expect
val rdd = sc.parallelize((1 to 100) map (i => (i, i)), 16)
  .partitionBy(new org.apache.spark.HashPartitioner(8))

val zeroth = rdd
  // If partition number is not zero ignore data
  .mapPartitionsWithIndex((idx, iter) => if (idx == 0) iter else Iterator())

// Check if we get expected results 8, 16, ..., 96
assert (zeroth.keys.map(_ % 8 == 0).reduce(_ & _) & zeroth.count == 12)
like image 148
zero323 Avatar answered Oct 12 '22 17:10

zero323