I would like to compute the top k words in a Spark-Streaming application, with lines of text collected in a time window.
I ended up with the following code:
...
val window = stream.window(Seconds(30))
val wc = window
.flatMap(line => line.split(" "))
.map(w => (w, 1))
.reduceByKey(_ + _)
wc.foreachRDD(rdd => {
println("---------------------------------------------------")
rdd.top(10)(Ordering.by(_._2)).zipWithIndex.foreach(println)
})
...
It seems to work.
Problem:
the top k word chart is computed using the foreachRDD
function that executes a top+print function on each RDD
returned by reduceByKey
(the wc
variable).
It turns out that reduceByKey
returns a DStream
with a single RDD
, so the above code works but the correct behaviour is not guaranteed by the specs.
Am I wrong, and it works in all circumstances ?
Why there is not, in spark-streaming, a way to consider a DStream
as a single RDD
, instead of a collection of RDD
objects, in order to execute more complex transformations ?
What I mean is a function like: dstream.withUnionRDD(rdd => ...)
that allows you making transformation and actions on a single/union RDD
. Is there an equivalent way to do such things?
Actually I completely misunderstood the concept of DStream composed of multiple RDDs. A DStream is made by multiple RDDs, but over time.
In the context of a micro-batch, the DStream is made of the current RDD.
So, the code above always works.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With