Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark RDD partition by key in exclusive way

I would like to partition an RDD by key and have that each parition contains only values of a single key. For example, if I have 100 different values of the key and I repartition(102), the RDD should have 2 empty partitions and 100 partitions containing each one a single key value.

I tried with groupByKey(k).repartition(102) but this does not guarantee the exclusivity of a key in each partition, since I see some partitions containing more values of a single key and more than 2 empty.

Is there a way in the standard API to do this?

like image 877
alexlipa Avatar asked Nov 19 '18 21:11

alexlipa


People also ask

How can you create an RDD with specific partitioning?

The loaded rdd is partitioned by default partitioner: hash code. To specify custom partitioner, use can check rdd. partitionBy(), provided with your own partitioner.

How does spark choose a partition key?

In Spark, one should carefully choose the number of partitions depending on the cluster design and application requirements. The best technique to determine the number of spark partitions in an RDD is to multiply the number of cores in the cluster with the number of partitions.

What is difference between repartition and partitionBy in spark?

repartition() is used to partition data in memory and partitionBy is used to partition data on disk.

How many partitions should a spark RDD have?

Spark can run 1 concurrent task for every partition of an RDD (up to the number of cores in the cluster). If you're cluster has 20 cores, you should have at least 20 partitions (in practice 2–3x times more).


2 Answers

to use partitionBy() RDD must consist of tuple (pair) objects. Lets see an example below:

Suppose I have an Input file with following data:

OrderId|OrderItem|OrderDate|OrderPrice|ItemQuantity
1|Gas|2018-01-17|1895|1
1|Air Conditioners|2018-01-28|19000|3
1|Television|2018-01-11|45000|2
2|Gas|2018-01-17|1895|1
2|Air Conditioners|2017-01-28|19000|3
2|Gas|2016-01-17|2300|1
1|Bottle|2018-03-24|45|10
1|Cooking oil|2018-04-22|100|3
3|Inverter|2015-11-02|29000|1
3|Gas|2014-01-09|2300|1
3|Television|2018-01-17|45000|2
4|Gas|2018-01-17|2300|1
4|Television$$|2018-01-17|45000|2
5|Medicine|2016-03-14|23.50|8
5|Cough Syrup|2016-01-28|190|1
5|Ice Cream|2014-09-23|300|7
5|Pasta|2015-06-30|65|2

PATH_TO_FILE="file:///u/vikrant/OrderInputFile"

reading file into RDD and skip header

RDD = sc.textFile(PATH_TO_FILE)
header=RDD.first();
newRDD = RDD.filter(lambda x:x != header)

now Lets re-partition RDD into '5' partitions

partitionRDD = newRDD.repartition(5)

lets have a look how data is being distributed in these '5' partitions

print("Partitions structure: {}".format(partitionRDD.glom().collect()))

here you can see that data is written into two partitions and, three of them are empty and also it's not being distributed uniformly.

Partitions structure: [[], 
[u'1|Gas|2018-01-17|1895|1', u'1|Air Conditioners|2018-01-28|19000|3', u'1|Television|2018-01-11|45000|2', u'2|Gas|2018-01-17|1895|1', u'2|Air Conditioners|2017-01-28|19000|3', u'2|Gas|2016-01-17|2300|1', u'1|Bottle|2018-03-24|45|10', u'1|Cooking oil|2018-04-22|100|3', u'3|Inverter|2015-11-02|29000|1', u'3|Gas|2014-01-09|2300|1'], 
[u'3|Television|2018-01-17|45000|2', u'4|Gas|2018-01-17|2300|1', u'4|Television$$|2018-01-17|45000|2', u'5|Medicine|2016-03-14|23.50|8', u'5|Cough Syrup|2016-01-28|190|1', u'5|Ice Cream|2014-09-23|300|7', u'5|Pasta|2015-06-30|65|2'], 
[], []]

We need create a pair RDD in order have the RDD data distributed uniformly across the number of partitions. Lets create a pair RDD and break it into key value pair.

pairRDD = newRDD.map(lambda x :(x[0],x[1:]))

now lets re partition this rdd into '5' partition and distribute the data uniformly into the partitions using key at [0]th position.

newpairRDD = pairRDD.partitionBy(5,lambda k: int(k[0]))

now we can see that data is being distributed uniformly according to the matching key value pairs.

print("Partitions structure: {}".format(newpairRDD.glom().collect()))
Partitions structure: [
[(u'5', u'|Medicine|2016-03-14|23.50|8'), 
(u'5', u'|Cough Syrup|2016-01-28|190|1'), 
(u'5', u'|Ice Cream|2014-09-23|300|7'), 
(u'5', u'|Pasta|2015-06-30|65|2')],

[(u'1', u'|Gas|2018-01-17|1895|1'), 
(u'1', u'|Air Conditioners|2018-01-28|19000|3'), 
(u'1', u'|Television|2018-01-11|45000|2'), 
(u'1', u'|Bottle|2018-03-24|45|10'), 
(u'1', u'|Cooking oil|2018-04-22|100|3')], 

[(u'2', u'|Gas|2018-01-17|1895|1'), 
(u'2', u'|Air Conditioners|2017-01-28|19000|3'), 
(u'2', u'|Gas|2016-01-17|2300|1')], 

[(u'3', u'|Inverter|2015-11-02|29000|1'), 
(u'3', u'|Gas|2014-01-09|2300|1'), 
(u'3', u'|Television|2018-01-17|45000|2')], 

[(u'4', u'|Gas|2018-01-17|2300|1'), 
(u'4', u'|Television$$|2018-01-17|45000|2')]
]

below you can verify the number of records in each partitions.

from pyspark.sql.functions import desc
from pyspark.sql.functions import spark_partition_id

partitionSizes = newpairRDD.glom().map(len).collect();

[4, 5, 3, 3, 2]

Please note that when you create a pair RDD of key value pair, your key should be of type int else you will get an error.

Hope this helps!

like image 185
vikrant rana Avatar answered Oct 12 '22 13:10

vikrant rana


For an RDD, have you tried using partitionBy to partition the RDD by key, like in this question? You can specify the number of partitions to be the number of keys to get rid of the empty partitions if desired.

In the Dataset API, you can use repartition with a Column as an argument to partition by the values in that column (although note that this uses the value of spark.sql.shuffle.partitions as the number of partitions, so you'll get a lot more empty partitions).

like image 23
Jason Avatar answered Oct 12 '22 14:10

Jason