What is the meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?

Question

While fetching data from SQL Server via a JDBC connection in Spark, I found that I can set some parallelization parameters like partitionColumn, lowerBound, upperBound, and numPartitions. I have gone through spark documentation but wasn't able to understand it.

Can anyone explain me the meanings of these parameters?

user7280077 · Accepted Answer

It is simple:

partitionColumn is a column which should be used to determine partitions.
lowerBound and upperBound determine range of values to be fetched. Complete dataset will use rows corresponding to the following query:
```
SELECT * FROM table WHERE partitionColumn BETWEEN lowerBound AND upperBound 
```
numPartitions determines number of partitions to be created. Range between lowerBound and upperBound is divided into numPartitions each with stride equal to:
```
upperBound / numPartitions - lowerBound / numPartitions 
```
For example if:
- lowerBound: 0
- upperBound: 1000
- numPartitions: 10
Stride is equal to 100 and partitions correspond to following queries:
- SELECT * FROM table WHERE partitionColumn BETWEEN 0 AND 100
- SELECT * FROM table WHERE partitionColumn BETWEEN 100 AND 200
- ...
- SELECT * FROM table WHERE partitionColumn BETWEEN 900 AND 1000

What is the meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?

Tags:

jdbc

apache-spark

apache-spark-sql

Bhanuday Birla

1 Answers

user7280077

Recent Activity

Donate For Us

What is the meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?

Tags:

jdbc

apache-spark

apache-spark-sql

Bhanuday Birla

1 Answers

user7280077

Related questions

Recent Activity

Donate For Us