I am trying to understand transform on Spark DStream in Spark Streaming.
I knew that transform in much superlative compared to map, but Can some one give me some real time example or clear example that can differentiate transform and map.?
reduceByKey(func, [numTasks]) When called on a DStream of (K, V) pairs, ReduceByKey function in Spark returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function.
Internally, a DStream is a sequence of RDDs. Spark receives real-time data and divides it into smaller batches for the execution engine. In contrast, Structured Streaming is built on the SparkSQL API for data stream processing.
Spark Map Transformation. A map is a transformation operation in Apache Spark. It applies to each element of RDD and it returns the result as new RDD. In the Map, operation developer can define his own custom business logic. The same logic will be applied to all the elements of RDD.
3-filter(func) — Return a new DStream by selecting only the records of the source DStream on which func returns true.
map
is an elementary transformation and transform
is an RDD transformation
map(func) : Return a new DStream by passing each element of the source DStream through a function func.
Here is an example which demonstrates both map operation and transform operation on a DStream
val conf = new SparkConf().setMaster("local[*]").setAppName("StreamingTransformExample")
val ssc = new StreamingContext(conf, Seconds(5))
val rdd1 = ssc.sparkContext.parallelize(Array(1,2,3))
val rdd2 = ssc.sparkContext.parallelize(Array(4,5,6))
val rddQueue = new Queue[RDD[Int]]
rddQueue.enqueue(rdd1)
rddQueue.enqueue(rdd2)
val numsDStream = ssc.queueStream(rddQueue, true)
val plusOneDStream = numsDStream.map(x => x+1)
plusOneDStream.print()
the map
operation adds 1 to each element in all the RDDs within DStream, gives an output as shown below
-------------------------------------------
Time: 1501135220000 ms
-------------------------------------------
2
3
4
-------------------------------------------
Time: 1501135225000 ms
-------------------------------------------
5
6
7
-------------------------------------------
transform(func) : Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream.
val commonRdd = ssc.sparkContext.parallelize(Array(0))
val combinedDStream = numsDStream.transform(rdd=>(rdd.union(commonRdd)))
combinedDStream.print()
transform allows to perform RDD operation such as join, union etc upon the RDDs within DStream, the example code given here will produce an output as below
-------------------------------------------
Time: 1501135490000 ms
-------------------------------------------
1
2
3
0
-------------------------------------------
Time: 1501135495000 ms
-------------------------------------------
4
5
6
0
-------------------------------------------
Time: 1501135500000 ms
-------------------------------------------
0
-------------------------------------------
Time: 1501135505000 ms
-------------------------------------------
0
-------------------------------------------
here the commonRdd
which contains the element 0
is performed a union operation with all the underlying RDDs within the DStream.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With