How to Define Custom partitioner for Spark RDDs of equally sized partition where each partition has equal number of elements?

Tags:

I am new to Spark. I have a large dataset of elements[RDD] and I want to divide it into two exactly equal sized partitions maintaining order of elements. I tried using RangePartitioner like

var data = partitionedFile.partitionBy(new RangePartitioner(2, partitionedFile))

This doesn't give a satisfactory result because it divides roughly but not exactly equal sized maintaining order of elements. For example if there are 64 elements, we use Rangepartitioner, then it divides into 31 elements and 33 elements.

I need a partitioner such that I get exactly first 32 elements in one half and other half contains second set of 32 elements. Could you please help me by suggesting how to use a customized partitioner such that I get equally sized two halves, maintaining the order of elements?

956

asked Apr 17 '14 07:04

yh18190

1 Answers

Partitioners work by assigning a key to a partition. You would need prior knowledge of the key distribution, or look at all keys, to make such a partitioner. This is why Spark does not provide you with one.

In general you do not need such a partitioner. In fact I cannot come up with a use case where I would need equal-size partitions. What if the number of elements is odd?

Anyway, let us say you have an RDD keyed by sequential Ints, and you know how many in total. Then you could write a custom Partitioner like this:

class ExactPartitioner[V](     partitions: Int,     elements: Int)   extends Partitioner {    def getPartition(key: Any): Int = {     val k = key.asInstanceOf[Int]     // `k` is assumed to go continuously from 0 to elements-1.     return k * partitions / elements   } }

199

answered Sep 20 '22 07:09

Daniel Darabos

Related questions
                            
                                How to get the element index when mapping an array in Scala?
                            
                                A better way to test the value of an Option?
                            
                                In Scala, how to use Ordering[T] with List.min or List.max and keep code readable
                            
                                Which version of Java does SBT use?
                            
                                Scala and Java BigDecimal
                            
                                Recursively create directory
                            
                                Code exercising the unique possibilities of each edge of the lambda calculus
                            
                                Can Map be performed on a Scala HList
                            
                                Is it possible to use a Java 8 style method references in Scala?
                            
                                Scala: sliding(N,N) vs grouped(N)
                            
                                How to read gzip'd file in Scala
                            
                                What does Scala's "try" mean without either a catch or finally block?
                            
                                Blocking calls in Akka Actors
                            
                                Why do I get `java.lang.NoClassDefFoundError: scala/Function1` when I run my code in ScalaIDE?
                            
                                Scala: public getter with private setter?
                            
                                Cake pattern with Java8 possible?
                            
                                Get companion object of class by given generic type Scala
                            
                                Avoiding accidental removal of duplicates when mapping a Set
                            
                                How to perform one operation on each executor once in spark
                            
                                Scala ListBuffer (or equivalent) shuffle

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to Define Custom partitioner for Spark RDDs of equally sized partition where each partition has equal number of elements?

Tags:

scala

apache-spark

hadoop

yh18190

People also ask

1 Answers

Daniel Darabos

Recent Activity

Donate For Us