While fetching data from SQL Server via a JDBC connection in Spark, I found that I can set some parallelization parameters like partitionColumn
, lowerBound
, upperBound
, and numPartitions
. I have gone through spark documentation but wasn't able to understand it.
Can anyone explain me the meanings of these parameters?
It is simple:
partitionColumn
is a column which should be used to determine partitions.lowerBound
and upperBound
determine range of values to be fetched. Complete dataset will use rows corresponding to the following query:
SELECT * FROM table WHERE partitionColumn BETWEEN lowerBound AND upperBound
numPartitions
determines number of partitions to be created. Range between lowerBound
and upperBound
is divided into numPartitions
each with stride equal to:
upperBound / numPartitions - lowerBound / numPartitions
For example if:
lowerBound
: 0upperBound
: 1000numPartitions
: 10
Stride is equal to 100 and partitions correspond to following queries:
SELECT * FROM table WHERE partitionColumn BETWEEN 0 AND 100
SELECT * FROM table WHERE partitionColumn BETWEEN 100 AND 200
...
SELECT * FROM table WHERE partitionColumn BETWEEN 900 AND 1000
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With