In spark streaming, every batch interval of data always generate one and only one RDD, why do we use foreachRDD()
to foreach RDD? RDD is only one, needn't foreach. In my testing, I never see RDD more than one.
Discretized Streams (DStreams) It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream.
Some of the output operations are print(), save() etc.. The save operation takes directory to save file into and an optional suffix. The print() takes in the first 10 elements from each batch of the DStream and prints the result.
foreachRDD is a very important output action that is applied to each RDD in a DStream.It takes a function which has an RDD of the corresponding DStream as argument, and outputs Unit (the empty type in Scala).
DStreams internally is characterized by a few basic properties: A list of other DStreams that the DStream depends on. A time interval at which the DStream generates an RDD. A function that is used to generate an RDD after each time interval.
A DStream
or "discretized stream" is an abstraction that breaks a continuous stream of data into small chunks. This is called "microbatching". Each microbatch becomes an RDD that is given to Spark for further processing. There's one and only one RDD produced for each DStream at each batch interval.
An RDD is a distributed collection of data. Think of it as a set of pointers to where the actual data is in a cluster.
DStream.foreachRDD
is an "output operator" in Spark Streaming. It allows you to access the underlying RDDs of the DStream to execute actions that do something practical with the data. For example, using foreachRDD
you could write data to a database.
The little mind twist here is to understand that a DStream is a time-bound collection. Let me contrast this with a classical collection: Take a list of users and apply a foreach to it:
val userList: List[User] = ??? userList.foreach{user => doSomeSideEffect(user)}
This will apply the side-effecting function doSomeSideEffect
to each element of the userList
collection.
Now, let's say that we don't know all the users now, so we cannot build a list of them. Instead, we have a stream of users, like people arriving into a coffee shop during morning rush:
val userDStream: DStream[User] = ??? userDstream.foreachRDD{usersRDD => usersRDD.foreach{user => serveCoffee(user)} }
Note that:
DStream.foreachRDD
gives you an RDD[User]
, not a single user. Going back to our coffee example, that is the collection of users that arrived during some interval of time.rdd.foreach
to serve coffee to each user.To think about execution: We might have a cluster of baristas making coffee. Those are our executors. Spark Streaming takes care of making a small batch of users (or orders) and Spark will distribute the work across the baristas, so that we can parallelize the coffee making and speed up the coffee serving.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With