I am trying to create a custom partitioner in a spark job using PySpark, say, I have of list of some integers [10,20,30,40,50,10,20,35]. Now I want a scenario where I have two partitions say p1 and p2. p1 contains all the list elements < 30 and p2 contains all the elements above 30.
elements = sc.parallelize([10,20,30,40,50,10,20,35]).map(lambda x : (float(x)/10,x)).partitionBy(2).glom().collect()
The above code partitions the list according to the hash of the arbitrary key I am passing. Is there anyway of partitioning the list according to a particular scenario ? Like the value is less than x or anything like that ?
Piggybacking off of FaigB's answer, you want to partition on if the value is above a threshold, not the value itself. Here's how it'd look in python
rdd = sc.parallelize([10,20,30,40,50,10,20,35]).map(lambda x : (float(x)/10, float(x)/10))
elements = rdd.partitionBy(2,lambda x: int(x > 3)).map(lambda x: x[0]).glom().collect()
elements
Which results in
[[1.0, 2.0, 3.0, 1.0, 2.0], [4.0, 5.0, 3.5]]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With