How to get data from a specific partition in Spark RDD?

Tags:

apache-spark

rdd

I want to access data from a particular partition in Spark RDD. I can get address of a partition as follow:

myRDD.partitions(0)

But I want to get data from myRDD.partitions(0) partition. I tried official org.apache.spark documentation but couldn't find.

Thanks in advance.

805

asked Sep 11 '15 07:09

Vikash Pareek

1 Answers

You can use mapPartitionsWithIndex as follows

// Create (1, 1), (2, 2), ..., (100, 100) dataset
// and partition by key so we know what to expect
val rdd = sc.parallelize((1 to 100) map (i => (i, i)), 16)
  .partitionBy(new org.apache.spark.HashPartitioner(8))

val zeroth = rdd
  // If partition number is not zero ignore data
  .mapPartitionsWithIndex((idx, iter) => if (idx == 0) iter else Iterator())

// Check if we get expected results 8, 16, ..., 96
assert (zeroth.keys.map(_ % 8 == 0).reduce(_ & _) & zeroth.count == 12)

148

answered Oct 12 '22 17:10

zero323

Related questions
                            
                                PySpark isin function
                            
                                Spark repartitioning by column with dynamic number of partitions per column
                            
                                Spark Configuration: SPARK_MEM vs. SPARK_WORKER_MEMORY
                            
                                NotSerializableException with json4s on Spark
                            
                                Spark MLLib TFIDF implementation for LogisticRegression
                            
                                Apache Spark error : Could not connect to akka.tcp://sparkMaster@
                            
                                Spark - Checkpointing implication on performance
                            
                                Get all the nodes connected to a node in Apache Spark GraphX
                            
                                SPARK, ML, Tuning, CrossValidator: access the metrics
                            
                                No suitable driver found for jdbc in Spark
                            
                                Why does SparkLauncher return immediately and spawn no job?
                            
                                SQL query Frequency Distribution matrix for product
                            
                                How to load CSVs with timestamps in custom format?
                            
                                Spark-shell meaning of displayed Number on Stage
                            
                                Spark/Yarn: File does not exist on HDFS
                            
                                How to write streaming Dataset to Cassandra?
                            
                                Why is Spark not using all cores on local machine
                            
                                Running spark-submit with --master yarn-cluster: issue with spark-assembly
                            
                                What controls how much of a Spark Cluster is given to an application?
                            
                                Error when using multiple python files spark-submit

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With