Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what is exact difference between Spark Transform in DStream and map.?

I am trying to understand transform on Spark DStream in Spark Streaming.

I knew that transform in much superlative compared to map, but Can some one give me some real time example or clear example that can differentiate transform and map.?

like image 885
Srini Avatar asked Aug 23 '15 14:08

Srini


People also ask

What will happen if you call the ReduceByKey transformation on DStream?

reduceByKey(func, [numTasks]) When called on a DStream of (K, V) pairs, ReduceByKey function in Spark returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function.

What is difference between DStream and structured Streaming?

Internally, a DStream is a sequence of RDDs. Spark receives real-time data and divides it into smaller batches for the execution engine. In contrast, Structured Streaming is built on the SparkSQL API for data stream processing.

What is map transformation in Spark?

Spark Map Transformation. A map is a transformation operation in Apache Spark. It applies to each element of RDD and it returns the result as new RDD. In the Map, operation developer can define his own custom business logic. The same logic will be applied to all the elements of RDD.

Which transformation returns a new DStream by selecting only those records of the source DStream for which the function returns true?

3-filter(func) — Return a new DStream by selecting only the records of the source DStream on which func returns true.


1 Answers

map is an elementary transformation and transform is an RDD transformation

map


map(func) : Return a new DStream by passing each element of the source DStream through a function func.

Here is an example which demonstrates both map operation and transform operation on a DStream

val conf = new SparkConf().setMaster("local[*]").setAppName("StreamingTransformExample")
val ssc = new StreamingContext(conf, Seconds(5))    

val rdd1 = ssc.sparkContext.parallelize(Array(1,2,3))
val rdd2 = ssc.sparkContext.parallelize(Array(4,5,6))
val rddQueue = new Queue[RDD[Int]]
rddQueue.enqueue(rdd1)
rddQueue.enqueue(rdd2)

val numsDStream = ssc.queueStream(rddQueue, true)
val plusOneDStream = numsDStream.map(x => x+1)
plusOneDStream.print()

the map operation adds 1 to each element in all the RDDs within DStream, gives an output as shown below

-------------------------------------------
Time: 1501135220000 ms
-------------------------------------------
2
3
4

-------------------------------------------
Time: 1501135225000 ms
-------------------------------------------
5
6
7

-------------------------------------------

transform


transform(func) : Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream.

val commonRdd = ssc.sparkContext.parallelize(Array(0))
val combinedDStream = numsDStream.transform(rdd=>(rdd.union(commonRdd)))
combinedDStream.print()

transform allows to perform RDD operation such as join, union etc upon the RDDs within DStream, the example code given here will produce an output as below

-------------------------------------------
Time: 1501135490000 ms
-------------------------------------------
1
2
3
0

-------------------------------------------
Time: 1501135495000 ms
-------------------------------------------
4
5
6
0

-------------------------------------------
Time: 1501135500000 ms
-------------------------------------------
0

-------------------------------------------
Time: 1501135505000 ms
-------------------------------------------
0
-------------------------------------------

here the commonRdd which contains the element 0 is performed a union operation with all the underlying RDDs within the DStream.

like image 144
Remis Haroon - رامز Avatar answered Oct 21 '22 03:10

Remis Haroon - رامز