Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spark foreachPartition, how to get an index of each partition?

spark foreachPartition, how to get an index of the partition (or sequence number, or something to identify the partition)?

val docs: RDD[String] = ...

println("num partitions: " + docs.getNumPartitions)

docs.foreachPartition((it: Iterator[String]) => {
  println("partition index: " + ???)
  it.foreach(...)
})
like image 505
David Portabella Avatar asked Jan 22 '18 14:01

David Portabella


People also ask

How Spark decides number of partitions?

The number of partitions in spark should be decided thoughtfully based on the cluster configuration and requirements of the application. Increasing the number of partitions will make each partition have less data or no data at all.

What is foreach in Spark?

In Spark, foreach() is an action operation that is available in RDD, DataFrame, and Dataset to iterate/loop over each element in the dataset, It is similar to for with advance concepts.

How do I change the number of partitions in a Spark data frame?

If you want to increase the partitions of your DataFrame, all you need to run is the repartition() function. Returns a new DataFrame partitioned by the given partitioning expressions. The resulting DataFrame is hash partitioned.


1 Answers

You can use TaskContext (How to get ID of a map task in Spark?):

import org.apache.spark.TaskContext

rdd.foreachPartition((it: Iterator[String]) => {
  println(TaskContext.getPartitionId)
})
like image 125
Alper t. Turker Avatar answered Sep 27 '22 21:09

Alper t. Turker