spark foreachPartition
, how to get an index of the partition (or sequence number, or something to identify the partition)?
val docs: RDD[String] = ...
println("num partitions: " + docs.getNumPartitions)
docs.foreachPartition((it: Iterator[String]) => {
println("partition index: " + ???)
it.foreach(...)
})
The number of partitions in spark should be decided thoughtfully based on the cluster configuration and requirements of the application. Increasing the number of partitions will make each partition have less data or no data at all.
In Spark, foreach() is an action operation that is available in RDD, DataFrame, and Dataset to iterate/loop over each element in the dataset, It is similar to for with advance concepts.
If you want to increase the partitions of your DataFrame, all you need to run is the repartition() function. Returns a new DataFrame partitioned by the given partitioning expressions. The resulting DataFrame is hash partitioned.
You can use TaskContext
(How to get ID of a map task in Spark?):
import org.apache.spark.TaskContext
rdd.foreachPartition((it: Iterator[String]) => {
println(TaskContext.getPartitionId)
})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With