Does DStream's RDD pull entire data created for the batch interval at one shot?

Question

I have gone through this stackoverflow question, as per the answer it creates a DStream with only one RDD for the batch interval.

For example:

My batch interval is 1 minute and Spark Streaming job is consuming data from Kafka Topic.

My question is, does the RDD available in DStream pulls/contains the entire data for the last one minute? Is there any criteria or options we need to set to pull all the data created for the last one minute?

If i have a Kafka topic with 3 partitions, and all the 3 partitions contains the data for the last one minute, will the DStream pulls/contains all the data created for the last one minute in all the Kafka topic partitions?

Update:

In Which case DStream contains more than one RDD?

Jacek Laskowski · Accepted Answer

A Spark Streaming DStream is consuming data from a Kafka topic that is partitioned, say to 3 partitions on 3 different Kafka brokers.

Does the RDD available in DStream pulls/contains the entire data for the last one minute?

Not quite. The RDD only describes the offsets to read data from when tasks are submitted for execution. It is just like with the other RDDs in Spark where they are only (?) a description of what to do and where to find data to work on when their tasks are submitted.

If you however use "pulls/contains" in a more loose way to express that at some point the records (from the partitions at given offsets) are going to be processed, yes, you're right, the entire minute is mapped to offsets and the offsets are in turn mapped to records that Kafka hands over to process.

in all the Kafka topic partitions?

Yes. It's Kafka not necessarily Spark Streaming / DStream / RDD to handle it. A DStream's RDDs request records from topic(s) and their partitions per offsets, from the last time it queried to now.

The minute for Spark Streaming might be slightly different for Kafka since a DStream's RDDs contain records for offsets not records per time.

In which case DStream contains more than one RDD?

Never.

vanekjar · Answer

I recommend to read more about DStream abstraction in Spark documentation.

Discretized Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data [...]. Internally, a DStream is represented by a continuous series of RDDs.

I would add one point to that – don't forget that RDD itself is another layer of abstraction and so it can be divided into smaller chunks and spread across the cluster.

Considering your questions:

Yes, after each batch interval fires, there is a job with one RDD. And this RDD contains all the data from the previous minute.
If your job consumes Kafka stream with more partitions, all the partitions are consumed in parallel. So the result is that data from all partitions are processed in the subsequent RDD.

Does DStream's RDD pull entire data created for the batch interval at one shot?

Tags:

apache-kafka

apache-spark

spark-streaming

dstream

Shankar

2 Answers

Jacek Laskowski

vanekjar

Recent Activity

Donate For Us

Does DStream's RDD pull entire data created for the batch interval at one shot?

Tags:

apache-kafka

apache-spark

spark-streaming

dstream

Shankar

2 Answers

Jacek Laskowski

vanekjar

Related questions

Recent Activity

Donate For Us