I am trying to understand transform on Spark DStream in Spark Streaming. I knew that transform in much superlative compared to map, but Can some one give me some real time example or clear example that can differentiate transform and map.?

<code>map</code> is an elementary transformation and <code>transform</code> is an RDD transformation <h3>map</h3> <hr> <blockquote> map(func) : Return a new DStream by passing each element of the source DStream through a function func. </blockquote> Here is an example which demonstrates both map operation and transform operation on a DStream <pre class="prettyprint"><code>val conf = new SparkConf().setMaster("local[*]").setAppName("StreamingTransformExample") val ssc = new StreamingContext(conf, Seconds(5)) val rdd1 = ssc.sparkContext.parallelize(Array(1,2,3)) val rdd2 = ssc.sparkContext.parallelize(Array(4,5,6)) val rddQueue = new Queue[RDD[Int]] rddQueue.enqueue(rdd1) rddQueue.enqueue(rdd2) val numsDStream = ssc.queueStream(rddQueue, true) val plusOneDStream = numsDStream.map(x => x+1) plusOneDStream.print() </code></pre> the <code>map</code> operation adds 1 to each element in all the RDDs within DStream, gives an output as shown below <pre class="prettyprint"><code>------------------------------------------- Time: 1501135220000 ms ------------------------------------------- 2 3 4 ------------------------------------------- Time: 1501135225000 ms ------------------------------------------- 5 6 7 ------------------------------------------- </code></pre> <h3>transform</h3> <hr> <blockquote> transform(func) : Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream. </blockquote> <pre class="prettyprint"><code>val commonRdd = ssc.sparkContext.parallelize(Array(0)) val combinedDStream = numsDStream.transform(rdd=>(rdd.union(commonRdd))) combinedDStream.print() </code></pre> transform allows to perform RDD operation such as join, union etc upon the RDDs within DStream, the example code given here will produce an output as below <pre class="prettyprint"><code>------------------------------------------- Time: 1501135490000 ms ------------------------------------------- 1 2 3 0 ------------------------------------------- Time: 1501135495000 ms ------------------------------------------- 4 5 6 0 ------------------------------------------- Time: 1501135500000 ms ------------------------------------------- 0 ------------------------------------------- Time: 1501135505000 ms ------------------------------------------- 0 ------------------------------------------- </code></pre> here the <code>commonRdd</code> which contains the element <code>0</code> is performed a union operation with all the underlying RDDs within the DStream.

what is exact difference between Spark Transform in DStream and map.?

1 Answers

map is an elementary transformation and transform is an RDD transformation

map

map(func) : Return a new DStream by passing each element of the source DStream through a function func.

Here is an example which demonstrates both map operation and transform operation on a DStream

val conf = new SparkConf().setMaster("local[*]").setAppName("StreamingTransformExample")
val ssc = new StreamingContext(conf, Seconds(5))    

val rdd1 = ssc.sparkContext.parallelize(Array(1,2,3))
val rdd2 = ssc.sparkContext.parallelize(Array(4,5,6))
val rddQueue = new Queue[RDD[Int]]
rddQueue.enqueue(rdd1)
rddQueue.enqueue(rdd2)

val numsDStream = ssc.queueStream(rddQueue, true)
val plusOneDStream = numsDStream.map(x => x+1)
plusOneDStream.print()

the map operation adds 1 to each element in all the RDDs within DStream, gives an output as shown below

-------------------------------------------
Time: 1501135220000 ms
-------------------------------------------
2
3
4

-------------------------------------------
Time: 1501135225000 ms
-------------------------------------------
5
6
7

-------------------------------------------

transform

transform(func) : Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream.

val commonRdd = ssc.sparkContext.parallelize(Array(0))
val combinedDStream = numsDStream.transform(rdd=>(rdd.union(commonRdd)))
combinedDStream.print()

transform allows to perform RDD operation such as join, union etc upon the RDDs within DStream, the example code given here will produce an output as below

-------------------------------------------
Time: 1501135490000 ms
-------------------------------------------
1
2
3
0

-------------------------------------------
Time: 1501135495000 ms
-------------------------------------------
4
5
6
0

-------------------------------------------
Time: 1501135500000 ms
-------------------------------------------
0

-------------------------------------------
Time: 1501135505000 ms
-------------------------------------------
0
-------------------------------------------

here the commonRdd which contains the element 0 is performed a union operation with all the underlying RDDs within the DStream.

144

answered Oct 21 '22 03:10

Remis Haroon - رامز

Related questions
                            
                                AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks'>
                            
                                Spark runs out of memory when grouping by key
                            
                                How to upgrade Spark to newer version?
                            
                                Spark case class - decimal type encoder error "Cannot up cast from decimal"
                            
                                Read all Parquet files saved in a folder via Spark
                            
                                How to use first and last function in pyspark?
                            
                                How to save a huge pandas dataframe to hdfs?
                            
                                how to pass python package to spark job and invoke main file from package with arguments
                            
                                scala vs java for Spark? [closed]
                            
                                Spark jobs finishes but application takes time to close
                            
                                Is foreachRDD executed on the Driver?
                            
                                Add one more StructField to schema
                            
                                Loading compressed gzipped csv file in Spark 2.0
                            
                                What is StringIndexer , VectorIndexer, and how to use them?
                            
                                Mapping Spark DataSet row values into new hash column
                            
                                External Hive Table Refresh table vs MSCK Repair
                            
                                get first N elements from dataframe ArrayType column in pyspark
                            
                                Spark: save DataFrame partitioned by "virtual" column
                            
                                Spark: get number of cluster cores programmatically
                            
                                How do I filter rows based on whether a column value is in a Set of Strings in a Spark DataFrame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

what is exact difference between Spark Transform in DStream and map.?

Tags:

apache-spark

spark-streaming

Srini

People also ask

1 Answers

map

transform

Remis Haroon - رامز

Recent Activity

Donate For Us