Apache Spark: distinct doesnt work?

Tags:

apache-spark

Here is my code example:

 case class Person(name:String,tel:String){
        def equals(that:Person):Boolean = that.name == this.name && this.tel == that.tel}

 val persons = Array(Person("peter","139"),Person("peter","139"),Person("john","111"))
 sc.parallelize(persons).distinct.collect

It returns

 res34: Array[Person] = Array(Person(john,111), Person(peter,139), Person(peter,139))

Why distinct doesn't work?I want the result as Person("john",111),Person("peter",139)

762

asked Jul 22 '14 11:07

1 Answers

Extending further from the observation of @aaronman, there is a workaround for this issue. On the RDD, there're two definitions for distinct:

 /**
   * Return a new RDD containing the distinct elements in this RDD.
   */
  def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] =
    map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)

  /**
   * Return a new RDD containing the distinct elements in this RDD.
   */
  def distinct(): RDD[T] = distinct(partitions.size)

It's apparent from the signature of the first distinct that there must be an implicit ordering of the elements and it's assumed null if absent, which is what the short version .distinct() does.

There's no default implicit ordering for case classes, but it's easy to implement one:

case class Person(name:String,tel:String) extends Ordered[Person] {
  def compare(that: Person): Int = this.name compare that.name
}

Now, trying the same example delivers the expected results (note that I'm comparing names):

val ps5 = Array(Person("peter","138"),Person("peter","55"),Person("john","138"))
sc.parallelize(ps5).distinct.collect

res: Array[P5] = Array(P5(john,111), P5(peter,139))

Note that case classes already implement equals and hashCode, so the impl on the provided example is unnecessary and also incorrect. The correct signature for equals is: equals(arg0: Any): Boolean -- BTW, I first thought that the issue had to do with the incorrect equals signature, which sent me looking in the wrong path.

193

answered Oct 13 '22 09:10

maasg

Related questions
                            
                                Why doesn't Unit extend Product in Scala?
                            
                                How to recover messages in Akka Actors now that Durable Mailboxes are removed?
                            
                                Scala Intellij breakpoints ignored
                            
                                Efficiently manipulating subsets of RDD's keys in spark
                            
                                Scala None instance not == None
                            
                                The object-functional impedance mismatch
                            
                                Akka-streams: how to get flow names in metrics reported by kamon-akka
                            
                                Implementing a Cake Pattern with implicit functionality
                            
                                No implementation for OWrites and Reads was bound in Scala Play app
                            
                                Including a Spark Package JAR file in a SBT generated fat JAR
                            
                                Cannot infer contravariant Nothing type parameter
                            
                                Should I persist a Spark dataframe if I keep adding columns in it?
                            
                                How to separate out parsing from validation in case of versioned config using scala?
                            
                                Proper way to access shared resource in Scala actors
                            
                                How to specialize on a type projection in Scala?
                            
                                Is there a quick way to convert Java xml objects to Scala xml objects?
                            
                                Lift Ajax multi select box
                            
                                How to handle exceptions in a playframework 2 Async block (scala)
                            
                                How to add custom IntelliJ Language Injection to scala string-interpolation?
                            
                                How to use Scala ARM with Futures?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Apache Spark: distinct doesnt work?

Tags:

scala

apache-spark

edwardsbean

People also ask

1 Answers

maasg

Recent Activity

Donate For Us