JDBC to Spark Dataframe - How to ensure even partitioning?

Tags:

I am new to Spark, and am working on creating a DataFrame from a Postgres database table via JDBC, using spark.read.jdbc.

I am a bit confused about the partitioning options, in particular partitionColumn, lowerBound, upperBound, and numPartitions.

The documentation seems to indicate that these fields are optional. What happens if I don't provide them?
How does Spark know how to partition the queries? How efficient will that be?
If I DO specify these options, how do I ensure that the partition sizes are roughly even if the partitionColumn is not evenly distributed?

Let's say I'm going to have 20 executors, so I set my numPartitions to 20.
My partitionColumn is an auto-incremented ID field, and let's say the values range from 1 to 2,000,000
However, because the user selects to process some really old data, along with some really new data, with nothing in the middle, most of the data has ID values either under 100,000 or over 1,900,000.

Will my 1st and 20th executors get most of the work, while the other 18 executors sit there mostly idle?
If so, is there a way to prevent this?

892

asked Jun 10 '19 22:06

JoeMjr2

1 Answers

I found a way to manually specify the partition boundaries, by using the jdbc constructor with the predicates parameter.

It allows you to explicitly specify individual conditions to be inserted in the "where" clause for each partition, which allows you to specify exactly which range of rows each partition will receive. So, if you don't have a uniformly distributed column to auto-partition on, you can customize your own partition strategy.

An example of how to use it can be found in the accepted answer to this question.

answered Nov 11 '22 00:11

JoeMjr2

Related questions
                            
                                How to set Kafka parameters from a properties file?
                            
                                How to map rows to protobuf-generated class?
                            
                                Submit a Spark job from C# and get results
                            
                                write a spark Dataset to json with all keys in the schema, including null columns
                            
                                Remove special character from a column in dataframe
                            
                                Spark Dataframe hanging on save
                            
                                SparkR DataFrame partitioning issue
                            
                                spark-shell: strange behavior with import
                            
                                ERROR WHILE RUNNING collect() in PYSPARK
                            
                                Stateful udfs in spark sql, or how to obtain mapPartitions performance benefit in spark sql?
                            
                                Continuous trigger not found in Structured Streaming
                            
                                Cannot load pipeline model from pyspark
                            
                                prioritizing partitions / task execution in spark
                            
                                How to skip multiple lines using read.csv in PySpark
                            
                                AWS EMR 5.20 and Java version support
                            
                                PySpark 2.x: Programmatically adding Maven JAR Coordinates to Spark
                            
                                Spark structured streaming exactly once - Not achieved - Duplicated events
                            
                                When to use a UDF versus a function in PySpark? [duplicate]
                            
                                How to apply large python model to pyspark-dataframe?
                            
                                Spark Caused by: java.lang.StackOverflowError Window Function?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

JDBC to Spark Dataframe - How to ensure even partitioning?

Tags:

jdbc

apache-spark

apache-spark-sql

partitioning

JoeMjr2

People also ask

1 Answers

JoeMjr2

Recent Activity

Donate For Us