Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Number of Partitions of Spark Dataframe

Can anyone explain about the number of partitions that will be created for a Spark Dataframe.

I know that for a RDD, while creating it we can mention the number of partitions like below.

val RDD1 = sc.textFile("path" , 6) 

But for Spark dataframe while creating looks like we do not have option to specify number of partitions like for RDD.

Only possibility i think is, after creating dataframe we can use repartition API.

df.repartition(4)

So can anyone please let me know if we can specify the number of partitions while creating a dataframe.

like image 439
Ramesh Avatar asked Sep 07 '16 11:09

Ramesh


People also ask

How do I know the number of partitions in Spark?

PySpark (Spark with Python) Similarly, in PySpark you can get the current length/size of partitions by running getNumPartitions() of RDD class, so to use with DataFrame first you need to convert to RDD.

How do I know how many partitions a data frame has?

Finding the number of partitions Simply turn the DataFrame to rdd and call partitions followed by size to get the number of partitions. We would see the number of partitions as 200.

What is a good number of partitions in Spark?

The general recommendation for Spark is to have 4x of partitions to the number of cores in cluster available for application, and for upper bound — the task should take 100ms+ time to execute.

What is the default number of partitions in Spark?

By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value.


1 Answers

You cannot, or at least not in a general case but it is not that different compared to RDD. For example textFile example code you've provides sets only a limit on the minimum number of partitions.

In general:

  • Datasets generated locally using methods like range or toDF on local collection will use spark.default.parallelism.
  • Datasets created from RDD inherit number of partitions from its parent.
  • Datsets created using data source API:

    • In Spark 1.x typically depends on the Hadoop configuration (min / max split size).
    • In Spark 2.x there is a Spark SQL specific configuration in use.
  • Some data sources may provide additional options which give more control over partitioning. For example JDBC source allows you to set partitioning column, values range and desired number of partitions.
like image 152
zero323 Avatar answered Dec 18 '22 14:12

zero323