We have a Spark Streaming application
running on YARN Cluster.
It receiving messages from Kafka topics
.
Actually our Processing time is more than the batch interval.
Batch Interval : 1 Minute
Processing Time : 5 Minutes
I would like to know , what happens if some data is received in between the processing time, will the data available in memory till the processing over. Or it will be overridden in the subsequent data fetching?
We are using Direct Streaming approach
to fetch data from Kafka topics.
Should i go with Window based
operations? for example if i have Window length as 5 Minutes and Sliding interval as 2 Minutes and Batch Interval as 1 Minute
, will it work?? Because we cannot lose any data in our application.
A batch interval tells spark that for what duration you have to fetch the data, like if its 1 minute, it would fetch the data for the last 1 minute. source: spark.apache.org. So the data would start pouring in a stream in batches, this continuous stream of data is called DStream.
Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches. Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data.
Minimum batch size Spark Streaming can use.is 500 milliseconds, is has proven to be a good minimum size for many applications. The best approach is to start with a larger batch size (around 10 seconds) and work your way down to a smaller batch size.
Basically, you can define data frames and work with them how you normally do while writing a batch job, but the processing of data differs. One thing important to note here is structured streaming does not process the data in real-time but instead in near-real-time.
In the direct streaming approach, data isn't read by a receiver and then dispatched to other workers. What happens is the driver reads the offsets from Kafka, and then sends each partition with a subset of the offsets to be read.
If your workers haven't finished processing the previous job, they won't start processing the next one (unless you explicitly set spark.streaming.concurrentJobs
to more than 1). This means that the offsets will be read, but won't actually dispatch to the executors responsible for reading the data, thus there won't be any data lose whatsoever.
What this does mean is that your job is going to infinitely be late and cause massive processing delays, which isn't something you want. As a rule of thumb any Spark jobs processing time should be less than the interval set for that job.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With