Is there any way to get the current number of partitions of a DataFrame? I checked the DataFrame javadoc (spark 1.6) and didn't found a method for that, or am I just missed it? (In case of JavaRDD there's a getNumPartitions() method.)
Similarly, in PySpark you can get the current length/size of partitions by running getNumPartitions() of RDD class, so to use with DataFrame first you need to convert to RDD.
➠ getNumPartitions: RDD function getNumPartitions can be used to get the number of partition in a dataframe. Example 1: Dataframe "df" was converted to RDD using rdd attribute and then getNumPartitions function was applied on it to get number of partitions.
The number of partitions of n is given by the partition function p(n). So p(4) = 5. The notation λ ⊢ n means that λ is a partition of n. Partitions can be graphically visualized with Young diagrams or Ferrers diagrams.
You need to call getNumPartitions()
on the DataFrame's underlying RDD, e.g., df.rdd.getNumPartitions()
. In the case of Scala, this is a parameterless method: df.rdd.getNumPartitions
.
dataframe.rdd.partitions.size
is another alternative apart from df.rdd.getNumPartitions()
or df.rdd.length
.
let me explain you this with full example...
val x = (1 to 10).toList val numberDF = x.toDF(“number”) numberDF.rdd.partitions.size // => 4
To prove that how many number of partitions we got with above... save that dataframe as csv
numberDF.write.csv(“/Users/Ram.Ghadiyaram/output/numbers”)
Here is how the data is separated on the different partitions.
Partition 00000: 1, 2 Partition 00001: 3, 4, 5 Partition 00002: 6, 7 Partition 00003: 8, 9, 10
@Hemanth asked a good question in the comment... basically why number of partitions are 4 in above case
Short answer : Depends on cases where you are executing. since local[4] I used, I got 4 partitions.
Long answer :
I was running above program in my local machine and used master as local[4] based on that it was taking as 4 partitions.
val spark = SparkSession.builder() .appName(this.getClass.getName) .config("spark.master", "local[4]").getOrCreate()
If its spark-shell in master yarn I got the number of partitions as 2
example : spark-shell --master yarn
and typed same commands again
scala> val x = (1 to 10).toList x: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) scala> val numberDF = x.toDF("number") numberDF: org.apache.spark.sql.DataFrame = [number: int] scala> numberDF.rdd.partitions.size res0: Int = 2
--master local
and based on your Runtime.getRuntime.availableProcessors()
i.e. local[Runtime.getRuntime.availableProcessors()]
it will try to allocate those number of partitions. if your available number of processors are 12 (i.e. local[Runtime.getRuntime.availableProcessors()])
and you have list of 1 to 10 then only 10 partitions will be created.NOTE:
If you are on a 12-core laptop where I am executing spark program and by default the number of partitions/tasks is the number of all available cores i.e. 12. that means
local[*]
ors"local[${Runtime.getRuntime.availableProcessors()}]")
but in this case only 10 numbers are there so it will limit to 10
keeping all these pointers in mind I would suggest you to try on your own
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With