Custom partitioner in SPARK (pyspark)

Question

I am trying to create a custom partitioner in a spark job using PySpark, say, I have of list of some integers [10,20,30,40,50,10,20,35]. Now I want a scenario where I have two partitions say p1 and p2. p1 contains all the list elements < 30 and p2 contains all the elements above 30.

elements = sc.parallelize([10,20,30,40,50,10,20,35]).map(lambda x : (float(x)/10,x)).partitionBy(2).glom().collect()

The above code partitions the list according to the hash of the arbitrary key I am passing. Is there anyway of partitioning the list according to a particular scenario ? Like the value is less than x or anything like that ?

David · Accepted Answer

Piggybacking off of FaigB's answer, you want to partition on if the value is above a threshold, not the value itself. Here's how it'd look in python

rdd = sc.parallelize([10,20,30,40,50,10,20,35]).map(lambda x : (float(x)/10, float(x)/10))
elements = rdd.partitionBy(2,lambda x: int(x > 3)).map(lambda x: x[0]).glom().collect()
elements

Which results in

[[1.0, 2.0, 3.0, 1.0, 2.0], [4.0, 5.0, 3.5]]

Custom partitioner in SPARK (pyspark)

Tags:

apache-spark

pyspark

User9523

1 Answers

David

Recent Activity

Donate For Us

Custom partitioner in SPARK (pyspark)

Tags:

apache-spark

pyspark

User9523

1 Answers

David

Related questions

Recent Activity

Donate For Us