I have scoreTriplets is RDD[ARRAY[String]] which I am sorting by following way. <pre class="prettyprint"><code>var ScoreTripletsArray = scoreTriplets.collect() if (ScoreTripletsArray.size > 0) { /*Sort the ScoreTripletsArray descending by score field*/ scala.util.Sorting.stableSort(ScoreTripletsArray, (e1: Array[String], e2: Array[String]) => e1(3).toInt > e2(3).toInt) } </code></pre> But collect() will be heavy If there is elements in lack. So I need to sort RDD by <code>score</code> without using collect(). scoreTriples is RDD[ARRAY[String]] each line of RDD will store Array of the below variables. EdgeId sourceID destID <code>score</code> sourceNAme destNAme distance Please give me any reference or hint.

Sorting will be, due to shuffling, an expensive operation even without collecting but you can use <code>sortBy</code> method: <pre class="prettyprint"><code>import scala.util.Random val data = Seq.fill(10)(Array.fill(3)("") :+ Random.nextInt.toString) val rdd = sc.parallelize(data) val sorted = rdd.sortBy(_.apply(3).toInt) sorted.take(3) // Array[Array[String]] = Array( // Array("", "", "", -1660860558), // Array("", "", "", -1643214719), // Array("", "", "", -1206834289)) </code></pre> If you're interested only in the top results then <code>top</code> and <code>takeOrdered</code> are usually preferred. <pre class="prettyprint"><code>import scala.math.Ordering rdd.takeOrdered(2)(Ordering.by[Array[String], Int](_.apply(3).toInt)) // Array[Array[String]] = // Array(Array("", "", "", -1660860558), Array("", "", "", -1643214719)) rdd.top(2)(Ordering.by[Array[String], Int](_.apply(3).toInt)) // Array[Array[String]] = // Array(Array("", "", "", 1920955686), Array("", "", "", 1597012602)) </code></pre>

There is sortBy method in RDD (see doc). You can do something like that <pre class="prettyprint"><code>scoreTriplets.sortBy( _(3).toInt ) </code></pre>

How to sort RDD

Tags:

sorting

scala

apache-spark

rdd

I have scoreTriplets is RDD[ARRAY[String]] which I am sorting by following way.

Click to copy

var ScoreTripletsArray = scoreTriplets.collect()
  if (ScoreTripletsArray.size > 0) {        
    /*Sort the ScoreTripletsArray descending by score field*/        
    scala.util.Sorting.stableSort(ScoreTripletsArray, (e1: Array[String], e2: Array[String]) => e1(3).toInt > e2(3).toInt)
}

But collect() will be heavy If there is elements in lack.

So I need to sort RDD by score without using collect().
scoreTriples is RDD[ARRAY[String]] each line of RDD will store Array of the below variables.
EdgeId sourceID destID score sourceNAme destNAme distance

Please give me any reference or hint.

291

asked Nov 18 '15 08:11

Sandip Armal Patil

2 Answers

Sorting will be, due to shuffling, an expensive operation even without collecting but you can use sortBy method:

Click to copy

import scala.util.Random

val data = Seq.fill(10)(Array.fill(3)("") :+ Random.nextInt.toString)
val rdd  = sc.parallelize(data)

val sorted = rdd.sortBy(_.apply(3).toInt)
sorted.take(3)
// Array[Array[String]] = Array(
//   Array("", "", "", -1660860558),
//   Array("", "", "", -1643214719),
//   Array("", "", "", -1206834289))

If you're interested only in the top results then top and takeOrdered are usually preferred.

Click to copy

import scala.math.Ordering

rdd.takeOrdered(2)(Ordering.by[Array[String], Int](_.apply(3).toInt))
// Array[Array[String]] = 
//   Array(Array("", "", "", -1660860558), Array("", "", "", -1643214719))

rdd.top(2)(Ordering.by[Array[String], Int](_.apply(3).toInt))
// Array[Array[String]] = 
//   Array(Array("", "", "", 1920955686), Array("", "", "", 1597012602))

answered Oct 18 '22 04:10

zero323

There is sortBy method in RDD (see doc). You can do something like that

Click to copy

scoreTriplets.sortBy( _(3).toInt )

answered Oct 18 '22 05:10

ponkin

Related questions
                            
                                Scala trait function: return instance of derived type
                            
                                Scala for ( ) vs for { }
                            
                                How to add Options to a List
                            
                                Difference between Array and List initialization in Scala
                            
                                Serialization Exception on spark
                            
                                Scala - String to Url
                            
                                Scala partial sum with current and all past elements in the list
                            
                                WeakTypeTag v. TypeTag
                            
                                Option.map (null) returns Some (null)
                            
                                Importance of Akka Routers
                            
                                What is monad analog in Java?
                            
                                Using `firstOption` with slick 3
                            
                                How to make scalatest matcher to ignore white-spaces when compare two strings?
                            
                                Using shapeless scala to merge the fields of two different case classes
                            
                                read application.conf from build.sbt
                            
                                "Plugin Scala is incompatible with this installation" error with IntelliJ 14
                            
                                SBT 0.13.0 - can't expand macros compiled by previous versions of Scala
                            
                                Is it possible to make an Akka HTTP core client request inside an Actor?
                            
                                How to vectorize DataFrame columns for ML algorithms?
                            
                                mapping over HList inside a function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to sort RDD

Tags:

sorting

scala

apache-spark

rdd

Sandip Armal Patil

People also ask

2 Answers

zero323

ponkin

Recent Activity

Donate For Us