Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the meaning of DStream.foreachRDD function?

In spark streaming, every batch interval of data always generate one and only one RDD, why do we use foreachRDD() to foreach RDD? RDD is only one, needn't foreach. In my testing, I never see RDD more than one.

like image 403
Guo Avatar asked Apr 05 '16 08:04

Guo


People also ask

What does DStream mean?

Discretized Streams (DStreams) It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream.

What is DStream operation?

Some of the output operations are print(), save() etc.. The save operation takes directory to save file into and an optional suffix. The print() takes in the first 10 elements from each batch of the DStream and prints the result.

What is Spark foreachRDD?

foreachRDD is a very important output action that is applied to each RDD in a DStream.It takes a function which has an RDD of the corresponding DStream as argument, and outputs Unit (the empty type in Scala).

What is DStream internally?

DStreams internally is characterized by a few basic properties: A list of other DStreams that the DStream depends on. A time interval at which the DStream generates an RDD. A function that is used to generate an RDD after each time interval.


1 Answers

A DStream or "discretized stream" is an abstraction that breaks a continuous stream of data into small chunks. This is called "microbatching". Each microbatch becomes an RDD that is given to Spark for further processing. There's one and only one RDD produced for each DStream at each batch interval.

An RDD is a distributed collection of data. Think of it as a set of pointers to where the actual data is in a cluster.

DStream.foreachRDD is an "output operator" in Spark Streaming. It allows you to access the underlying RDDs of the DStream to execute actions that do something practical with the data. For example, using foreachRDD you could write data to a database.

The little mind twist here is to understand that a DStream is a time-bound collection. Let me contrast this with a classical collection: Take a list of users and apply a foreach to it:

val userList: List[User] = ??? userList.foreach{user => doSomeSideEffect(user)} 

This will apply the side-effecting function doSomeSideEffect to each element of the userList collection.

Now, let's say that we don't know all the users now, so we cannot build a list of them. Instead, we have a stream of users, like people arriving into a coffee shop during morning rush:

val userDStream: DStream[User] = ??? userDstream.foreachRDD{usersRDD =>      usersRDD.foreach{user => serveCoffee(user)} } 

Note that:

  • the DStream.foreachRDD gives you an RDD[User], not a single user. Going back to our coffee example, that is the collection of users that arrived during some interval of time.
  • to access single elements of the collection, we need to further operate on the RDD. In this case, I'm using a rdd.foreach to serve coffee to each user.

To think about execution: We might have a cluster of baristas making coffee. Those are our executors. Spark Streaming takes care of making a small batch of users (or orders) and Spark will distribute the work across the baristas, so that we can parallelize the coffee making and speed up the coffee serving.

like image 74
maasg Avatar answered Oct 14 '22 16:10

maasg