Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?

While fetching data from SQL Server via a JDBC connection in Spark, I found that I can set some parallelization parameters like partitionColumn, lowerBound, upperBound, and numPartitions. I have gone through spark documentation but wasn't able to understand it.

Can anyone explain me the meanings of these parameters?

like image 875
Bhanuday Birla Avatar asked Dec 11 '16 10:12

Bhanuday Birla


1 Answers

It is simple:

  • partitionColumn is a column which should be used to determine partitions.
  • lowerBound and upperBound determine range of values to be fetched. Complete dataset will use rows corresponding to the following query:

    SELECT * FROM table WHERE partitionColumn BETWEEN lowerBound AND upperBound 
  • numPartitions determines number of partitions to be created. Range between lowerBound and upperBound is divided into numPartitions each with stride equal to:

    upperBound / numPartitions - lowerBound / numPartitions 

    For example if:

    • lowerBound: 0
    • upperBound: 1000
    • numPartitions: 10

    Stride is equal to 100 and partitions correspond to following queries:

    • SELECT * FROM table WHERE partitionColumn BETWEEN 0 AND 100
    • SELECT * FROM table WHERE partitionColumn BETWEEN 100 AND 200
    • ...
    • SELECT * FROM table WHERE partitionColumn BETWEEN 900 AND 1000
like image 130
user7280077 Avatar answered Sep 20 '22 09:09

user7280077