Is there any way to get the current number of partitions of a DataFrame? I checked the DataFrame javadoc (spark 1.6) and didn't found a method for that, or am I just missed it? (In case of JavaRDD there's a getNumPartitions() method.)
Similarly, in PySpark you can get the current length/size of partitions by running getNumPartitions() of RDD class, so to use with DataFrame first you need to convert to RDD.
➠ getNumPartitions: RDD function getNumPartitions can be used to get the number of partition in a dataframe. Example 1: Dataframe "df" was converted to RDD using rdd attribute and then getNumPartitions function was applied on it to get number of partitions.
The number of partitions of n is given by the partition function p(n). So p(4) = 5. The notation λ ⊢ n means that λ is a partition of n. Partitions can be graphically visualized with Young diagrams or Ferrers diagrams.
You need to call getNumPartitions() on the DataFrame's underlying RDD, e.g., df.rdd.getNumPartitions(). In the case of Scala, this is a parameterless method: df.rdd.getNumPartitions.
dataframe.rdd.partitions.size is another alternative apart from df.rdd.getNumPartitions() or df.rdd.length.
let me explain you this with full example...
val x = (1 to 10).toList val numberDF = x.toDF(“number”) numberDF.rdd.partitions.size // => 4 To prove that how many number of partitions we got with above... save that dataframe as csv
numberDF.write.csv(“/Users/Ram.Ghadiyaram/output/numbers”) Here is how the data is separated on the different partitions.
Partition 00000: 1, 2 Partition 00001: 3, 4, 5 Partition 00002: 6, 7 Partition 00003: 8, 9, 10 @Hemanth asked a good question in the comment... basically why number of partitions are 4 in above case
Short answer : Depends on cases where you are executing. since local[4] I used, I got 4 partitions.
Long answer :
I was running above program in my local machine and used master as local[4] based on that it was taking as 4 partitions.
val spark = SparkSession.builder() .appName(this.getClass.getName) .config("spark.master", "local[4]").getOrCreate() If its spark-shell in master yarn I got the number of partitions as 2
example : spark-shell --master yarn and typed same commands again
scala> val x = (1 to 10).toList x: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) scala> val numberDF = x.toDF("number") numberDF: org.apache.spark.sql.DataFrame = [number: int] scala> numberDF.rdd.partitions.size res0: Int = 2 --master local and based on your Runtime.getRuntime.availableProcessors() i.e. local[Runtime.getRuntime.availableProcessors()] it will try to allocate those number of partitions. if your available number of processors are 12 (i.e. local[Runtime.getRuntime.availableProcessors()]) and you have list of 1 to 10 then only 10 partitions will be created.NOTE:
If you are on a 12-core laptop where I am executing spark program and by default the number of partitions/tasks is the number of all available cores i.e. 12. that means
local[*]ors"local[${Runtime.getRuntime.availableProcessors()}]")but in this case only 10 numbers are there so it will limit to 10
keeping all these pointers in mind I would suggest you to try on your own
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With