Find out the partition no/id

Tags:

apache-spark

Is there a way(A method) in Spark to find out the Parition ID/No

Take this example here

val input1 = sc.parallelize(List(8, 9, 10), 3)

val res = input1.reduce{ (x, y) => println("Inside partiton " + ???)

                               x + y)}

I would like to put some code in ??? to print the Partition ID / No

506

asked Jul 07 '15 23:07

Raj

4 Answers

I ran across this old question while looking for the spark_partition_id sql function for DataFrame.

val input = spark.sparkContext.parallelize(11 to 17, 3)
input.toDF.withColumn("id",spark_partition_id).rdd.collect

res7: Array[org.apache.spark.sql.Row] = Array([11,0], [12,0], [13,1], [14,1], [15,2], [16,2], [17,2])

answered Oct 21 '22 07:10

Jeremy

Indeed, the mapParitionsWithIndex will give you an iterator & the partition index. (This isn't the same as reduce of course, but you could combine the result of that with aggregate).

answered Oct 21 '22 07:10

Holden

You can also use

TaskContext.getPartitionId()

e.g., in lieu of the presently missing foreachPartitionWithIndex()

https://github.com/apache/spark/pull/5927#issuecomment-99697229

answered Oct 21 '22 09:10

steamer25

Posting the answer here using mapParitionsWithIndex based on suggestion by @Holden.

I have created an RDD(Input) with 3 Partitions. The elements in input is tagged with the Partition Index(index) in the call to mapPartitionsWithIndex

scala> val input = sc.parallelize(11 to 17, 3)
input: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:21

scala> input.mapPartitionsWithIndex{ (index, itr) => itr.toList.map(x => x + "#" + index).iterator }.collect()
res8: Array[String] = Array(11#0, 12#0, 13#1, 14#1, 15#2, 16#2, 17#2)

answered Oct 21 '22 07:10

Raj

Related questions
                            
                                How can I combine(concatenate) two data frames with the same column name in java
                            
                                Cannot resolve column (numeric column name) in Spark Dataframe
                            
                                How to convert date to the first day of month in a PySpark Dataframe column?
                            
                                Spark DataFrame Repartition and Parquet Partition
                            
                                How to use spark to generate huge amount of random integers?
                            
                                How to remove parentheses around records when saveAsTextFile on RDD[(String, Int)]?
                            
                                How to read whole file in one string
                            
                                Spark Multiclass Classification Example
                            
                                Apache Spark upgrade from 1.5.2 to 1.6.0 using homebrew leading to permission denied error during execution
                            
                                Multiple SparkContext detected in the same JVM
                            
                                How can I sum multiple columns in a spark dataframe in pyspark?
                            
                                How to set column names to toDF() function in spark dataframe using a string array?
                            
                                Creating a row number of each row in PySpark DataFrame using row_number() function with Spark version 2.2
                            
                                What is the Scala type mapping for all Spark SQL DataType
                            
                                Spark job in Java: how to access files from 'resources' when run on a cluster
                            
                                How to copy and convert parquet files to csv
                            
                                Create array of literals and columns from List of Strings in Spark SQL
                            
                                How to convert Row to json in Spark 2 Scala
                            
                                Compare in-memory cluster computing systems
                            
                                In Spark Dataframe how to get duplicate records and distinct records in two dataframes?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Find out the partition no/id

Tags:

apache-spark

Raj

People also ask

4 Answers

Jeremy

Holden

steamer25

Raj

Recent Activity

Donate For Us