Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Streaming - Batch Interval vs Processing time

We have a Spark Streaming application running on YARN Cluster.

It receiving messages from Kafka topics.

Actually our Processing time is more than the batch interval.

Batch Interval : 1 Minute
Processing Time : 5 Minutes

I would like to know , what happens if some data is received in between the processing time, will the data available in memory till the processing over. Or it will be overridden in the subsequent data fetching?

We are using Direct Streaming approach to fetch data from Kafka topics.

Should i go with Window based operations? for example if i have Window length as 5 Minutes and Sliding interval as 2 Minutes and Batch Interval as 1 Minute, will it work?? Because we cannot lose any data in our application.

like image 250
Shankar Avatar asked Feb 07 '17 14:02

Shankar


People also ask

What is a batch interval in Spark Streaming?

A batch interval tells spark that for what duration you have to fetch the data, like if its 1 minute, it would fetch the data for the last 1 minute. source: spark.apache.org. So the data would start pouring in a stream in batches, this continuous stream of data is called DStream.

Is Spark batch processing or stream processing?

Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches. Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data.

Which best describes the batch size in a Spark Streaming job?

Minimum batch size Spark Streaming can use.is 500 milliseconds, is has proven to be a good minimum size for many applications. The best approach is to start with a larger batch size (around 10 seconds) and work your way down to a smaller batch size.

Is Spark structured Streaming real-time?

Basically, you can define data frames and work with them how you normally do while writing a batch job, but the processing of data differs. One thing important to note here is structured streaming does not process the data in real-time but instead in near-real-time.


1 Answers

In the direct streaming approach, data isn't read by a receiver and then dispatched to other workers. What happens is the driver reads the offsets from Kafka, and then sends each partition with a subset of the offsets to be read.

If your workers haven't finished processing the previous job, they won't start processing the next one (unless you explicitly set spark.streaming.concurrentJobs to more than 1). This means that the offsets will be read, but won't actually dispatch to the executors responsible for reading the data, thus there won't be any data lose whatsoever.

What this does mean is that your job is going to infinitely be late and cause massive processing delays, which isn't something you want. As a rule of thumb any Spark jobs processing time should be less than the interval set for that job.

like image 96
Yuval Itzchakov Avatar answered Sep 22 '22 16:09

Yuval Itzchakov