Get current number of partitions of a DataFrame

Tags:

Is there any way to get the current number of partitions of a DataFrame? I checked the DataFrame javadoc (spark 1.6) and didn't found a method for that, or am I just missed it? (In case of JavaRDD there's a getNumPartitions() method.)

955

asked Feb 11 '17 02:02

kecso

2 Answers

You need to call getNumPartitions() on the DataFrame's underlying RDD, e.g., df.rdd.getNumPartitions(). In the case of Scala, this is a parameterless method: df.rdd.getNumPartitions.

117

answered Nov 03 '22 08:11

user4601931

dataframe.rdd.partitions.size is another alternative apart from df.rdd.getNumPartitions() or df.rdd.length.

let me explain you this with full example...

val x = (1 to 10).toList val numberDF = x.toDF(“number”) numberDF.rdd.partitions.size // => 4

To prove that how many number of partitions we got with above... save that dataframe as csv

numberDF.write.csv(“/Users/Ram.Ghadiyaram/output/numbers”)

Here is how the data is separated on the different partitions.

Partition 00000: 1, 2 Partition 00001: 3, 4, 5 Partition 00002: 6, 7 Partition 00003: 8, 9, 10

Update :

@Hemanth asked a good question in the comment... basically why number of partitions are 4 in above case

Short answer : Depends on cases where you are executing. since local[4] I used, I got 4 partitions.

Long answer :

I was running above program in my local machine and used master as local[4] based on that it was taking as 4 partitions.

val spark = SparkSession.builder()     .appName(this.getClass.getName)     .config("spark.master", "local[4]").getOrCreate()

If its spark-shell in master yarn I got the number of partitions as 2

example : spark-shell --master yarn and typed same commands again

scala> val x = (1 to 10).toList x: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)   scala> val numberDF = x.toDF("number") numberDF: org.apache.spark.sql.DataFrame = [number: int]  scala> numberDF.rdd.partitions.size res0: Int = 2

here 2 is default parllelism of spark
Based on hashpartitioner spark will decide how many number of partitions to distribute. if you are running in --master local and based on your Runtime.getRuntime.availableProcessors() i.e. local[Runtime.getRuntime.availableProcessors()] it will try to allocate those number of partitions. if your available number of processors are 12 (i.e. local[Runtime.getRuntime.availableProcessors()]) and you have list of 1 to 10 then only 10 partitions will be created.

NOTE:

If you are on a 12-core laptop where I am executing spark program and by default the number of partitions/tasks is the number of all available cores i.e. 12. that means local[*] or s"local[${Runtime.getRuntime.availableProcessors()}]") but in this case only 10 numbers are there so it will limit to 10

keeping all these pointers in mind I would suggest you to try on your own

answered Nov 03 '22 09:11

Ram Ghadiyaram

Related questions
                            
                                Variable Explorer in Jupyter Notebook
                            
                                Can I assign a reset index a name?
                            
                                How to apply numpy.linalg.norm to each row of a matrix?
                            
                                get file list of files contained in a zip file
                            
                                pprint dictionary on multiple lines
                            
                                How to check if array is not empty? [duplicate]
                            
                                How to get last record
                            
                                Access Jupyter notebook running on Docker container
                            
                                Keyboard shortcut to clear cell output in Jupyter notebook
                            
                                How include static files to setuptools - python package
                            
                                python catch exception and continue try block
                            
                                simple way to drop milliseconds from python datetime.datetime object [duplicate]
                            
                                What is the purpose of __str__ and __repr__? [duplicate]
                            
                                How to update json file with python
                            
                                Index of element in NumPy array [duplicate]
                            
                                Extract values in Pandas value_counts()
                            
                                How is order of items in matplotlib legend determined?
                            
                                how to import csv data into django models
                            
                                Does performance differ between Python or C++ coding of OpenCV?
                            
                                How to extract hours and minutes from a datetime.datetime object?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Get current number of partitions of a DataFrame

Tags:

python

dataframe

scala

apache-spark

apache-spark-sql

kecso

People also ask

2 Answers

user4601931

Update :

NOTE:

Ram Ghadiyaram

Recent Activity

Donate For Us