Number of Partitions of Spark Dataframe

Tags:

Can anyone explain about the number of partitions that will be created for a Spark Dataframe.

I know that for a RDD, while creating it we can mention the number of partitions like below.

val RDD1 = sc.textFile("path" , 6)

But for Spark dataframe while creating looks like we do not have option to specify number of partitions like for RDD.

Only possibility i think is, after creating dataframe we can use repartition API.

df.repartition(4)

So can anyone please let me know if we can specify the number of partitions while creating a dataframe.

439

asked Sep 07 '16 11:09

Ramesh

1 Answers

You cannot, or at least not in a general case but it is not that different compared to RDD. For example textFile example code you've provides sets only a limit on the minimum number of partitions.

In general:

Datasets generated locally using methods like range or toDF on local collection will use spark.default.parallelism.
Datasets created from RDD inherit number of partitions from its parent.
Datsets created using data source API:
- In Spark 1.x typically depends on the Hadoop configuration (min / max split size).
- In Spark 2.x there is a Spark SQL specific configuration in use.
Some data sources may provide additional options which give more control over partitioning. For example JDBC source allows you to set partitioning column, values range and desired number of partitions.

152

answered Dec 18 '22 14:12

zero323

Related questions
                            
                                Spark Configuration: SPARK_MEM vs. SPARK_WORKER_MEMORY
                            
                                NotSerializableException with json4s on Spark
                            
                                Spark MLLib TFIDF implementation for LogisticRegression
                            
                                Apache Spark error : Could not connect to akka.tcp://sparkMaster@
                            
                                Spark - Checkpointing implication on performance
                            
                                Get all the nodes connected to a node in Apache Spark GraphX
                            
                                SPARK, ML, Tuning, CrossValidator: access the metrics
                            
                                No suitable driver found for jdbc in Spark
                            
                                Why does SparkLauncher return immediately and spawn no job?
                            
                                SQL query Frequency Distribution matrix for product
                            
                                How to load CSVs with timestamps in custom format?
                            
                                Spark-shell meaning of displayed Number on Stage
                            
                                Spark/Yarn: File does not exist on HDFS
                            
                                How to write streaming Dataset to Cassandra?
                            
                                Why is Spark not using all cores on local machine
                            
                                Running spark-submit with --master yarn-cluster: issue with spark-assembly
                            
                                What controls how much of a Spark Cluster is given to an application?
                            
                                Error when using multiple python files spark-submit
                            
                                How to get data from a specific partition in Spark RDD?
                            
                                Access to Spark from Flask app

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Number of Partitions of Spark Dataframe

Tags:

dataframe

apache-spark

apache-spark-sql

Ramesh

People also ask

1 Answers

zero323

Recent Activity

Donate For Us