Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to sort data in spark streaming

I am new to spark, and try to write some example code base on spark and spark streaming.

So far, I have implemented sorting function in spark, here is the code:

  def sort(listSize: Int, slice: Int): Unit = {
    val conf = new SparkConf().setAppName(getClass.getName)
    val spark = new SparkContext(conf)
    val data = genRandom(listSize)
    val distData = spark.parallelize(data, slice)
    val result = distData.sortBy(x => x, true)
    val finalResult = result.collect()
    val step5 = System.currentTimeMillis()
    printlnArray(finalResult, 0, 10)
    spark.stop()
  }

  /**
   * generate random number
   * @return
   */
  def genRandom(listSize: Int): List[Int] = {
    val range = 100000
    var listBuffer = new ListBuffer[Int]
    val random = new Random()
    for (i <- 1 to listSize) listBuffer += random.nextInt(range)
    listBuffer.toList
  }

  def printlnArray(list: Array[Int], start: Int, offset: Int) {
    for (i <- start until start + offset) println(">>>>>>>>> list : " + i + " | " + list(i))
  }

I have a trouble on implementing sort function on spark streaming. As I know, spark RDD provide sort API in spark core, but there is not such API in spark streaming, Do anyone know how to do it ? Thanks

This is a dump question, but after google on web, I does not find an right answer. If anyone know how to solve it, thanks for your help.

like image 246
Chan Avatar asked Jan 06 '15 09:01

Chan


People also ask

What is sorting in Spark?

Both sort() and orderBy() functions can be used to sort Spark DataFrames on at least one column and any desired order, namely ascending or descending. sort() is more efficient compared to orderBy() because the data is sorted on each partition individually and this is why the order in the output data is not guaranteed.

How does Spark read streaming data?

Use readStream. format("socket") from Spark session object to read data from the socket and provide options host and port where you want to stream data from.

How do I sort RDD in Spark in descending order?

By default, sortByKey() sorts elements in ascending order, but you can change the sorting order by passing your custom ordering. For example, sortByKey(keyfunc =lambda k: -k) will sort the RDD in descending order.

What is the difference between orderBy and sort by in Spark?

Description. The SORT BY clause is used to return the result rows sorted within each partition in the user specified order. When there is more than one partition SORT BY may return result that is partially ordered. This is different than ORDER BY clause which guarantees a total order of the output.


1 Answers

You can leverage the transform function of a DStream to transform it by using the underlying RDDs.

For instance

myDStream.transform(rdd => rdd.sortByKey())
like image 97
Marco Avatar answered Oct 08 '22 19:10

Marco