Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does DStream's RDD pull entire data created for the batch interval at one shot?

I have gone through this stackoverflow question, as per the answer it creates a DStream with only one RDD for the batch interval.

For example:

My batch interval is 1 minute and Spark Streaming job is consuming data from Kafka Topic.

My question is, does the RDD available in DStream pulls/contains the entire data for the last one minute? Is there any criteria or options we need to set to pull all the data created for the last one minute?

If i have a Kafka topic with 3 partitions, and all the 3 partitions contains the data for the last one minute, will the DStream pulls/contains all the data created for the last one minute in all the Kafka topic partitions?

Update:

In Which case DStream contains more than one RDD?

like image 202
Shankar Avatar asked Dec 29 '25 08:12

Shankar


2 Answers

A Spark Streaming DStream is consuming data from a Kafka topic that is partitioned, say to 3 partitions on 3 different Kafka brokers.

Does the RDD available in DStream pulls/contains the entire data for the last one minute?

Not quite. The RDD only describes the offsets to read data from when tasks are submitted for execution. It is just like with the other RDDs in Spark where they are only (?) a description of what to do and where to find data to work on when their tasks are submitted.

If you however use "pulls/contains" in a more loose way to express that at some point the records (from the partitions at given offsets) are going to be processed, yes, you're right, the entire minute is mapped to offsets and the offsets are in turn mapped to records that Kafka hands over to process.

in all the Kafka topic partitions?

Yes. It's Kafka not necessarily Spark Streaming / DStream / RDD to handle it. A DStream's RDDs request records from topic(s) and their partitions per offsets, from the last time it queried to now.

The minute for Spark Streaming might be slightly different for Kafka since a DStream's RDDs contain records for offsets not records per time.

In which case DStream contains more than one RDD?

Never.

like image 82
Jacek Laskowski Avatar answered Dec 30 '25 20:12

Jacek Laskowski


I recommend to read more about DStream abstraction in Spark documentation.

Discretized Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data [...]. Internally, a DStream is represented by a continuous series of RDDs.

I would add one point to that – don't forget that RDD itself is another layer of abstraction and so it can be divided into smaller chunks and spread across the cluster.

Considering your questions:

  • Yes, after each batch interval fires, there is a job with one RDD. And this RDD contains all the data from the previous minute.
  • If your job consumes Kafka stream with more partitions, all the partitions are consumed in parallel. So the result is that data from all partitions are processed in the subsequent RDD.
like image 42
vanekjar Avatar answered Dec 30 '25 22:12

vanekjar



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!