Spark: Difference between numPartitions in read.jdbc(..numPartitions..) and repartition(..numPartitions..)

Tags:

I'm perplexed between the behaviour of numPartitions parameter in the following methods:

DataFrameReader.jdbc
Dataset.repartition

The official docs of DataFrameReader.jdbc say following regarding numPartitions parameter

numPartitions: the number of partitions. This, along with lowerBound (inclusive), upperBound (exclusive), form partition strides for generated WHERE clause expressions used to split the column columnName evenly.

And official docs of Dataset.repartition say

Returns a new Dataset that has exactly numPartitions partitions.

My current understanding:

The numPartition parameter in DataFrameReader.jdbc method controls the degree of parallelism in reading the data from database
The numPartition parameter in Dataset.repartition controls the number of output files that will be generated when this DataFrame would be written to disk

My questions:

If I read DataFrame via DataFrameReader.jdbc and then write it to disk (without invoking repartition method), then would there still be as many files in output as there would've been had I written out a DataFrame to disk after having invoked repartition on it?
If the answer to the above question is:
- Yes: Then is it redundant to invoke repartition method on a DataFrame that was read using DataFrameReader.jdbc method (with numPartitions parameter)?
- No: Then please correct the lapses in my understanding. Also in that case shouldn't the numPartitions parameter of DataFrameReader.jdbc method be called something like 'parallelism'?

506

asked Jan 16 '18 07:01

y2k-shubham

1 Answers

Short answer: There is (almost) no difference in behaviour of numPartitions parameter in the two methods

read.jdbc(..numPartitions..)

Here, the numPartitions parameter controls:

number of parallel connections that would be made to the MySQL (or any other RDBM) for reading the data into DataFrame.
Degree of parallelism on all subsequent operations on the read DataFrame including writing to disk until repartition method is invoked on it

repartition(..numPartitions..)

Here numPartitions parameter controls the degree of parallelism that would be exhibited in performing any operation of the DataFrame, including writing to disk.

So basically the DataFrame obtained on reading MySQL table using spark.read.jdbc(..numPartitions..) method behaves the same (exhibits the same degree of parallelism in operations performed over it) as if it was read without parallelism and the repartition(..numPartitions..) method was invoked on it afterwards (obviously with same value of numPartitions)

To answer to exact questions:

If I read DataFrame via DataFrameReader.jdbc and then write it to disk (without invoking repartition method), then would there still be as many files in output as there would've been had I written out a DataFrame to disk after having invoked repartition on it?

Yes

Assuming that the read task had been parallelized by providing appropriate parameters (columnName, lowerBound, upperBound & numPartitions), all operations on the resulting DataFrame including write will be performed in parallel. Quoting the official docs here:

numPartitions: The maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections. If the number of partitions to write exceeds this limit, we decrease it to this limit by calling coalesce(numPartitions) before writing.

Yes: Then is it redundant to invoke repartition method on a DataFrame that was read using DataFrameReader.jdbc method (with numPartitions parameter)?

Yes

Unless you invoke the other variations of repartition method (the ones that take columnExprs param), invoking repartition on such a DataFrame (with same numPartitions) parameter is redundant. However, I'm not sure if forcing same degree of parallelism on an already-parallelized DataFrame also invokes shuffling of data among executors unnecessarily. Will update the answer once I come across it.

164

answered Nov 23 '22 08:11

y2k-shubham

Related questions
                            
                                Custom Evaluator in PySpark
                            
                                Check if table exists in hive metastore using Pyspark
                            
                                How does Apache Spark handles system failure when deployed in YARN?
                            
                                Apache Spark or Cascading framework? [closed]
                            
                                How to get pass "requires authentication" while connecting to remote Cassandra cluster using SparkConf?
                            
                                Functions from Python packages for udf() of Spark dataframe
                            
                                Spark JSON text field to RDD
                            
                                Spark: scala.MatchError (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
                            
                                Getting NullPointerException using spark-csv with DataFrames
                            
                                Does a flatMap in spark cause a shuffle?
                            
                                How to use Spark's repartitionAndSortWithinPartitions?
                            
                                Select array element from Spark Dataframes split method in same call?
                            
                                Running yarn with spark not working with Java 8
                            
                                How to read in-memory JSON string into Spark DataFrame
                            
                                Why is the number of partitions after groupBy 200? Why is this 200 not some other number?
                            
                                Convert List into dataframe spark scala
                            
                                Memory efficient cartesian join in PySpark
                            
                                Get IDs for duplicate rows (considering all other columns) in Apache Spark
                            
                                How to force inferSchema for CSV to consider integers as dates (with "dateFormat" option)?
                            
                                How to pass the parameter to User-Defined Function?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark: Difference between numPartitions in read.jdbc(..numPartitions..) and repartition(..numPartitions..)

Tags:

dataframe

apache-spark

spark-dataframe

spark-jdbc

y2k-shubham

People also ask

1 Answers

y2k-shubham

Recent Activity

Donate For Us