Is there a way(A method) in Spark to find out the Parition ID/No
Take this example here
val input1 = sc.parallelize(List(8, 9, 10), 3)
val res = input1.reduce{ (x, y) => println("Inside partiton " + ???)
x + y)}
I would like to put some code in ???
to print the Partition ID / No
ALL_TAB_SUBPARTITIONS describes, for each table subpartition accessible to the current user, the subpartition name, name of the table and partition to which it belongs, and its storage attributes. DBA_TAB_SUBPARTITIONS provides such information for all subpartitions in the database.
If you need to find out if a table has been partitioned in SQL Server, you can run a join against the sys. tables , sys. indexes , and sys. partition_schemes views.
All tables have at least one partition, so if you are looking specifically for partitioned tables, then you'll have to filter this query based off of sys. partitions. partition_number <> 1 (for non-partitioned tables, the partition_number is always equal to 1).
Using the SHOW TABLE STATUS statement to determine whether a table is partitioned. Querying the INFORMATION_SCHEMA. PARTITIONS table. Using the statement EXPLAIN SELECT to see which partitions are used by a given SELECT .
I ran across this old question while looking for the spark_partition_id
sql function for DataFrame
.
val input = spark.sparkContext.parallelize(11 to 17, 3)
input.toDF.withColumn("id",spark_partition_id).rdd.collect
res7: Array[org.apache.spark.sql.Row] = Array([11,0], [12,0], [13,1], [14,1], [15,2], [16,2], [17,2])
Indeed, the mapParitionsWithIndex
will give you an iterator & the partition index. (This isn't the same as reduce of course, but you could combine the result of that with aggregate
).
You can also use
TaskContext.getPartitionId()
e.g., in lieu of the presently missing foreachPartitionWithIndex()
https://github.com/apache/spark/pull/5927#issuecomment-99697229
Posting the answer here using mapParitionsWithIndex
based on suggestion by @Holden.
I have created an RDD(Input
) with 3 Partitions. The elements in input
is tagged with the Partition Index(index
) in the call to mapPartitionsWithIndex
scala> val input = sc.parallelize(11 to 17, 3)
input: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:21
scala> input.mapPartitionsWithIndex{ (index, itr) => itr.toList.map(x => x + "#" + index).iterator }.collect()
res8: Array[String] = Array(11#0, 12#0, 13#1, 14#1, 15#2, 16#2, 17#2)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With