Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ordering an RDD[String]

Consider

val animals = List("penguin","ferret","cat").toSeq
val rdd = sc.makeRDD(animals, 1) 

I would like to order this RDD. I'm new to Scala and a little confused about how this is to be done.

like image 945
Chris Avatar asked May 29 '15 21:05

Chris


People also ask

Is RDD ordered?

textFile) the lines of the RDD will be in the order that they were in the file. map, filter, flatMap, and coalesce (with shuffle=false) do preserve the order like most of the RDD operations they work on Iterators inside the partitions. So, they just don't have any choice of messing up the order.

How do I sort a list in RDD?

Method 1: Using sortBy() sortBy() is used to sort the data by value efficiently in pyspark. It is a method available in rdd. It uses a lambda expression to sort the data based on columns.

How many RDDs can cogroup() can work at once?

cogroup() can be used for much more than just implementing joins. We can also use it to implement intersect by key. Additionally, cogroup() can work on three or more RDDs at once.

What is pair RDD in Spark?

Paired RDD is a distributed collection of data with the key-value pair. It is a subset of Resilient Distributed Dataset So it has all the features of RDD and some new feature for the key-value pair. There are many transformation operations available for Paired RDD.


1 Answers

RDD documentation can be found here. Look at sortBy:

sortBy[K](
  f: (T) ⇒ K, 
  ascending: Boolean = true, 
  numPartitions: Int = this.partitions.size
)

The K is the type of the snippet of the RDD you are sorting by. f is a function, which you can either define elsewhere with def and pass it by name or you can create one anonymously in line (which is more scala-like). ascending and numPartitions should be self explanatory.

So given all this, try:

rdd.sortBy[String]({animal => animal})

Then try this:

rdd.sortBy[String]({animal => animal}, false)

And then this one, which sorts the RDD by the number of letters "e" in the name of the animal, from most to least:

rdd.sortBy[Int]({a => a.split("").filter(char => char == "e").size}, false)

It should be noted that the original rdd isn't sorted -- a new, sorted RDD is returned by the operation.

like image 53
David Griffin Avatar answered Oct 22 '22 03:10

David Griffin