Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find out the partition no/id

Tags:

apache-spark

Is there a way(A method) in Spark to find out the Parition ID/No

Take this example here

val input1 = sc.parallelize(List(8, 9, 10), 3)

val res = input1.reduce{ (x, y) => println("Inside partiton " + ???)

                               x + y)}

I would like to put some code in ??? to print the Partition ID / No

like image 506
Raj Avatar asked Jul 07 '15 23:07

Raj


People also ask

How do I find Subpartitions in Oracle?

ALL_TAB_SUBPARTITIONS describes, for each table subpartition accessible to the current user, the subpartition name, name of the table and partition to which it belongs, and its storage attributes. DBA_TAB_SUBPARTITIONS provides such information for all subpartitions in the database.

How do I find the SQL partition name?

If you need to find out if a table has been partitioned in SQL Server, you can run a join against the sys. tables , sys. indexes , and sys. partition_schemes views.

How do I see table partitions?

All tables have at least one partition, so if you are looking specifically for partitioned tables, then you'll have to filter this query based off of sys. partitions. partition_number <> 1 (for non-partitioned tables, the partition_number is always equal to 1).

How do I know mysql partition name?

Using the SHOW TABLE STATUS statement to determine whether a table is partitioned. Querying the INFORMATION_SCHEMA. PARTITIONS table. Using the statement EXPLAIN SELECT to see which partitions are used by a given SELECT .


4 Answers

I ran across this old question while looking for the spark_partition_id sql function for DataFrame.

val input = spark.sparkContext.parallelize(11 to 17, 3)
input.toDF.withColumn("id",spark_partition_id).rdd.collect

res7: Array[org.apache.spark.sql.Row] = Array([11,0], [12,0], [13,1], [14,1], [15,2], [16,2], [17,2])
like image 92
Jeremy Avatar answered Oct 21 '22 07:10

Jeremy


Indeed, the mapParitionsWithIndex will give you an iterator & the partition index. (This isn't the same as reduce of course, but you could combine the result of that with aggregate).

like image 27
Holden Avatar answered Oct 21 '22 07:10

Holden


You can also use

TaskContext.getPartitionId()

e.g., in lieu of the presently missing foreachPartitionWithIndex()

https://github.com/apache/spark/pull/5927#issuecomment-99697229

like image 14
steamer25 Avatar answered Oct 21 '22 09:10

steamer25


Posting the answer here using mapParitionsWithIndex based on suggestion by @Holden.

I have created an RDD(Input) with 3 Partitions. The elements in input is tagged with the Partition Index(index) in the call to mapPartitionsWithIndex

scala> val input = sc.parallelize(11 to 17, 3)
input: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:21

scala> input.mapPartitionsWithIndex{ (index, itr) => itr.toList.map(x => x + "#" + index).iterator }.collect()
res8: Array[String] = Array(11#0, 12#0, 13#1, 14#1, 15#2, 16#2, 17#2)
like image 6
Raj Avatar answered Oct 21 '22 07:10

Raj