In spark streaming, every batch interval of data always generate one and only one RDD, why do we use <code>foreachRDD()</code> to foreach RDD? RDD is only one, needn't foreach. In my testing, I never see RDD more than one.

A <code>DStream</code> or "discretized stream" is an abstraction that breaks a continuous stream of data into small chunks. This is called "microbatching". Each microbatch becomes an RDD that is given to Spark for further processing. There's one and only one RDD produced for each DStream at each batch interval. An RDD is a distributed collection of data. Think of it as a set of pointers to where the actual data is in a cluster. <code>DStream.foreachRDD</code> is an "output operator" in Spark Streaming. It allows you to access the underlying RDDs of the DStream to execute actions that do something practical with the data. For example, using <code>foreachRDD</code> you could write data to a database. The little mind twist here is to understand that a DStream is a time-bound collection. Let me contrast this with a classical collection: Take a list of users and apply a foreach to it: <pre class="prettyprint"><code>val userList: List[User] = ??? userList.foreach{user => doSomeSideEffect(user)} </code></pre> This will apply the side-effecting function <code>doSomeSideEffect</code> to each element of the <code>userList</code> collection. Now, let's say that we don't know all the users now, so we cannot build a list of them. Instead, we have a stream of users, like people arriving into a coffee shop during morning rush: <pre class="prettyprint"><code>val userDStream: DStream[User] = ??? userDstream.foreachRDD{usersRDD => usersRDD.foreach{user => serveCoffee(user)} } </code></pre> Note that: <ul> <li>the <code>DStream.foreachRDD</code> gives you an <code>RDD[User]</code>, not a single user. Going back to our coffee example, that is the collection of users that arrived during some interval of time.</li> <li>to access single elements of the collection, we need to further operate on the RDD. In this case, I'm using a <code>rdd.foreach</code> to serve coffee to each user.</li> </ul> To think about execution: We might have a cluster of baristas making coffee. Those are our executors. Spark Streaming takes care of making a small batch of users (or orders) and Spark will distribute the work across the baristas, so that we can parallelize the coffee making and speed up the coffee serving.

What's the meaning of DStream.foreachRDD function?

Tags:

apache-spark

spark-streaming

In spark streaming, every batch interval of data always generate one and only one RDD, why do we use foreachRDD() to foreach RDD? RDD is only one, needn't foreach. In my testing, I never see RDD more than one.

403

asked Apr 05 '16 08:04

Guo

1 Answers

A DStream or "discretized stream" is an abstraction that breaks a continuous stream of data into small chunks. This is called "microbatching". Each microbatch becomes an RDD that is given to Spark for further processing. There's one and only one RDD produced for each DStream at each batch interval.

An RDD is a distributed collection of data. Think of it as a set of pointers to where the actual data is in a cluster.

DStream.foreachRDD is an "output operator" in Spark Streaming. It allows you to access the underlying RDDs of the DStream to execute actions that do something practical with the data. For example, using foreachRDD you could write data to a database.

The little mind twist here is to understand that a DStream is a time-bound collection. Let me contrast this with a classical collection: Take a list of users and apply a foreach to it:

val userList: List[User] = ??? userList.foreach{user => doSomeSideEffect(user)}

This will apply the side-effecting function doSomeSideEffect to each element of the userList collection.

Now, let's say that we don't know all the users now, so we cannot build a list of them. Instead, we have a stream of users, like people arriving into a coffee shop during morning rush:

val userDStream: DStream[User] = ??? userDstream.foreachRDD{usersRDD =>      usersRDD.foreach{user => serveCoffee(user)} }

Note that:

the DStream.foreachRDD gives you an RDD[User], not a single user. Going back to our coffee example, that is the collection of users that arrived during some interval of time.
to access single elements of the collection, we need to further operate on the RDD. In this case, I'm using a rdd.foreach to serve coffee to each user.

To think about execution: We might have a cluster of baristas making coffee. Those are our executors. Spark Streaming takes care of making a small batch of users (or orders) and Spark will distribute the work across the baristas, so that we can parallelize the coffee making and speed up the coffee serving.

answered Oct 14 '22 16:10

maasg

Related questions
                            
                                How to use orderby() with descending order in Spark window functions?
                            
                                Exploding nested Struct in Spark dataframe
                            
                                How to create a sample single-column Spark DataFrame in Python?
                            
                                How does Distinct() function work in Spark?
                            
                                How to replace null values with a specific value in Dataframe using spark in Java?
                            
                                How do I replace a string value with a NULL in PySpark?
                            
                                SparkSQL - Read parquet file directly
                            
                                How to make shark/spark clear the cache?
                            
                                IllegalAccessError to guava's StopWatch from org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus
                            
                                PySpark Logging?
                            
                                Merge Spark output CSV files with a single header
                            
                                Reading multiple files from S3 in Spark by date period
                            
                                Spark: Difference between Shuffle Write, Shuffle spill (memory), Shuffle spill (disk)?
                            
                                Convert a simple one line string to RDD in Spark
                            
                                What are broadcast variables? What problems do they solve?
                            
                                How to avoid generating crc files and SUCCESS files while saving a DataFrame?
                            
                                How to create SparkSession with Hive support (fails with "Hive classes are not found")?
                            
                                Fill in null with previously known good value with pyspark
                            
                                Count the distinct elements of each group by other field on a Spark 1.6 Dataframe
                            
                                Dataframe sample in Apache spark | Scala

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With