Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark : how can evenly distribute my records in all partition

Tags:

apache-spark

I have a RDD with 30 record (key/value pair : key is Time Stamp and Value is JPEG Byte Array)
and I am running 30 executors. I want to repartition this RDD in to 30 partitions so every partition gets one record and is assigned to one executor.

When I used rdd.repartition(30) it repartitions my rdd in 30 partitions but some partitions get 2 records, some get 1 record and some not getting any records.

Is there any way in Spark I can evenly distribute my records to all partitions.

like image 610
prateek arora Avatar asked Oct 30 '22 15:10

prateek arora


1 Answers

Salting technique can be used which involves adding a new "fake" key and using alongside the current key for better distribution of data.

(here is link for salting)

like image 160
devesh Avatar answered Nov 08 '22 15:11

devesh