I am running spark in cluster mode and reading data from RDBMS via JDBC.
As per Spark docs, these partitioning parameters describe how to partition the table when reading in parallel from multiple workers:
partitionColumn
lowerBound
upperBound
numPartitions
These are optional parameters.
What would happen if I don't specify these:
When we want spark to communicate with some RDBMS, we need a compatible connector. For MySQL, you can download its connector at this link MySQL Connector. Once you download it, we have to pass jar to Spark when we create SparkSession. If this does not work for you, you can also use below method to pass connector jar.
Spark/PySpark supports partitioning in memory (RDD/DataFrame) and partitioning on the disk (File system). Partition in memory: You can partition or repartition the DataFrame by calling repartition() or coalesce() transformations.
Apache Spark supports two types of partitioning “hash partitioning” and “range partitioning”.
Spark SQL also includes a data source that can read data from other databases using JDBC. This functionality should be preferred over using JdbcRDD.
If you don't specify either {partitionColumn
, lowerBound
, upperBound
, numPartitions
} or {predicates
} Spark will use a single executor and create a single non-empty partition. All data will be processed using a single transaction and reads will be neither distributed nor parallelized.
See also:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With