I would like to understand how the receiver works in Spark Streaming. As per my understanding there will be a receiver tasks running in executors that collect data and saves as RDD's. Receivers start reading when the start() is called. Need clarification on the following.
Would like to know the anatomy of Spark Streaming and receiver.
What are receivers ? Receivers are special objects in Spark Streaming which goal is to consume data from data sources and move it to Spark. Receivers are created by streaming context as long running tasks on different executors.
Generally, Spark streaming is used for real time processing. But it is an older or rather you can say original, RDD based Spark structured streaming is the newer, highly optimized API for Spark. Users are advised to use the newer Spark structured streaming API for Spark.
Apache Spark Streaming is a scalable fault-tolerant streaming processing system that natively supports both batch and streaming workloads.
I'm going to answer based on my experience with Kafka receivers, which seems more or less similar to what goes on in Kinesis.
How many receivers does the Spark Streaming job starts?. Multiple or One.
Each receiver you open is a single connection. In Kafka, if you want to read concurrently from multiple partitions, you need to open up multiple receivers, and usually union them together.
Is the receiver is implemented as push based or pull based?
Pull. In Spark Streaming, each batch interval (specified when creating the StreamingContext
) pulls data from Kafka.
In any case does the receiver can become a bottleneck?
Broad question. It depends. If your batch intervals are long and you have a single receiver, your backlog may start to fill. It's mainly trail and error until you reach the optimum balance in your streaming job.
To achieve the degree of parallelism the data should be partitioned across the worker nodes. So for the streaming data the how the data is distributed across the nodes.
You can create concurrency as I previously stated by opening multiple receivers to the underlying data source. Further, after reading the data, it can be repartitioned using the standard Spark mechanisms for partitioning data.
If new RDDs are formed on an new node based on batch time interval, how does SparkContext serialize the transform functions to the node after the Job is submitted.
The same way it serializes each Task in the stages, by using the serializer of choice and sending data over the wire. Not sure I understand what you mean here.
Can the amount of receivers launch be controlled by a parameter?
Yes, you can have a configuration parameter which determines the number of receivers you open. Such code can look like this:
// This may be your config parameter
val numStreams = 5
val kafkaStreams = (1 to numStreams).map { i => KafkaUtils.createStream(...) }
val unifiedStream = streamingContext.union(kafkaStreams)
unifiedStream.print()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With