Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to compute the top k words

I would like to compute the top k words in a Spark-Streaming application, with lines of text collected in a time window.

I ended up with the following code:

...
val window = stream.window(Seconds(30))

val wc = window
  .flatMap(line => line.split(" "))
  .map(w => (w, 1))
  .reduceByKey(_ + _)

wc.foreachRDD(rdd => {
  println("---------------------------------------------------")
  rdd.top(10)(Ordering.by(_._2)).zipWithIndex.foreach(println)
})
...

It seems to work.

Problem: the top k word chart is computed using the foreachRDD function that executes a top+print function on each RDD returned by reduceByKey (the wc variable).

It turns out that reduceByKey returns a DStream with a single RDD, so the above code works but the correct behaviour is not guaranteed by the specs.

Am I wrong, and it works in all circumstances ?

Why there is not, in spark-streaming, a way to consider a DStream as a single RDD, instead of a collection of RDD objects, in order to execute more complex transformations ?

What I mean is a function like: dstream.withUnionRDD(rdd => ...) that allows you making transformation and actions on a single/union RDD. Is there an equivalent way to do such things?

like image 843
Nicola Ferraro Avatar asked Apr 09 '15 13:04

Nicola Ferraro


1 Answers

Actually I completely misunderstood the concept of DStream composed of multiple RDDs. A DStream is made by multiple RDDs, but over time.

In the context of a micro-batch, the DStream is made of the current RDD.

So, the code above always works.

like image 79
Nicola Ferraro Avatar answered Oct 30 '22 20:10

Nicola Ferraro