Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Custom partitioner in SPARK (pyspark)

I am trying to create a custom partitioner in a spark job using PySpark, say, I have of list of some integers [10,20,30,40,50,10,20,35]. Now I want a scenario where I have two partitions say p1 and p2. p1 contains all the list elements < 30 and p2 contains all the elements above 30.

elements = sc.parallelize([10,20,30,40,50,10,20,35]).map(lambda x : (float(x)/10,x)).partitionBy(2).glom().collect()

The above code partitions the list according to the hash of the arbitrary key I am passing. Is there anyway of partitioning the list according to a particular scenario ? Like the value is less than x or anything like that ?

like image 791
User9523 Avatar asked Mar 30 '17 12:03

User9523


1 Answers

Piggybacking off of FaigB's answer, you want to partition on if the value is above a threshold, not the value itself. Here's how it'd look in python

rdd = sc.parallelize([10,20,30,40,50,10,20,35]).map(lambda x : (float(x)/10, float(x)/10))
elements = rdd.partitionBy(2,lambda x: int(x > 3)).map(lambda x: x[0]).glom().collect()
elements

Which results in

[[1.0, 2.0, 3.0, 1.0, 2.0], [4.0, 5.0, 3.5]]
like image 198
David Avatar answered Oct 05 '22 07:10

David