Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Partitioning in spark while reading from RDBMS via JDBC

I am running spark in cluster mode and reading data from RDBMS via JDBC.

As per Spark docs, these partitioning parameters describe how to partition the table when reading in parallel from multiple workers:

  • partitionColumn
  • lowerBound
  • upperBound
  • numPartitions

These are optional parameters.

What would happen if I don't specify these:

  • Only 1 worker read the whole data?
  • If it still reads parallelly, how does it partition data?
like image 452
Dev Avatar asked Mar 31 '17 22:03

Dev


People also ask

How does Spark read data from RDBMS?

When we want spark to communicate with some RDBMS, we need a compatible connector. For MySQL, you can download its connector at this link MySQL Connector. Once you download it, we have to pass jar to Spark when we create SparkSession. If this does not work for you, you can also use below method to pass connector jar.

How can we partition files in Spark?

Spark/PySpark supports partitioning in memory (RDD/DataFrame) and partitioning on the disk (File system). Partition in memory: You can partition or repartition the DataFrame by calling repartition() or coalesce() transformations.

What are the types of partitioning in Spark?

Apache Spark supports two types of partitioning “hash partitioning” and “range partitioning”.

Can Spark SQL read data from other databases?

Spark SQL also includes a data source that can read data from other databases using JDBC. This functionality should be preferred over using JdbcRDD.


1 Answers

If you don't specify either {partitionColumn, lowerBound, upperBound, numPartitions} or {predicates} Spark will use a single executor and create a single non-empty partition. All data will be processed using a single transaction and reads will be neither distributed nor parallelized.

See also:

  • How to optimize partitioning when migrating data from JDBC source?
  • How to improve performance for slow Spark jobs using DataFrame and JDBC connection?
like image 116
zero323 Avatar answered Oct 15 '22 04:10

zero323