Spark: Is receiver in spark streaming a bottleneck?

Tags:

I would like to understand how the receiver works in Spark Streaming. As per my understanding there will be a receiver tasks running in executors that collect data and saves as RDD's. Receivers start reading when the start() is called. Need clarification on the following.

How many receivers does the Spark Streaming job starts? Multiple or One?
Is the receiver is implemented as push based or pull based?
In any case does the receiver can become a bottleneck?
To achieve the degree of parallelism the data should be partitioned across the worker nodes. So for the streaming data the how the data is distributed across the nodes.
If a new RDD's are formed on an new node based on batch time interval, how does SparkContext serialize the transform functions to the node after the Job is submitted?
Can the amount of receivers launch be controlled by a parameter?

Would like to know the anatomy of Spark Streaming and receiver.

680

asked Mar 14 '16 06:03

nagendra

1 Answers

I'm going to answer based on my experience with Kafka receivers, which seems more or less similar to what goes on in Kinesis.

How many receivers does the Spark Streaming job starts?. Multiple or One.

Each receiver you open is a single connection. In Kafka, if you want to read concurrently from multiple partitions, you need to open up multiple receivers, and usually union them together.

Is the receiver is implemented as push based or pull based?

Pull. In Spark Streaming, each batch interval (specified when creating the StreamingContext) pulls data from Kafka.

In any case does the receiver can become a bottleneck?

Broad question. It depends. If your batch intervals are long and you have a single receiver, your backlog may start to fill. It's mainly trail and error until you reach the optimum balance in your streaming job.

To achieve the degree of parallelism the data should be partitioned across the worker nodes. So for the streaming data the how the data is distributed across the nodes.

You can create concurrency as I previously stated by opening multiple receivers to the underlying data source. Further, after reading the data, it can be repartitioned using the standard Spark mechanisms for partitioning data.

If new RDDs are formed on an new node based on batch time interval, how does SparkContext serialize the transform functions to the node after the Job is submitted.

The same way it serializes each Task in the stages, by using the serializer of choice and sending data over the wire. Not sure I understand what you mean here.

Can the amount of receivers launch be controlled by a parameter?

Yes, you can have a configuration parameter which determines the number of receivers you open. Such code can look like this:

// This may be your config parameter
val numStreams = 5
val kafkaStreams = (1 to numStreams).map { i => KafkaUtils.createStream(...) }

val unifiedStream = streamingContext.union(kafkaStreams)
unifiedStream.print()

188

answered Sep 18 '22 15:09

Yuval Itzchakov

Related questions
                            
                                Is it possible to ignore null values when using lead window function in Spark
                            
                                Does the SparkSQL Dataframe function explode preserve order?
                            
                                How to sort array of struct type in Spark DataFrame by particular column?
                            
                                Add UUID to spark dataset [duplicate]
                            
                                Why filter does not preserve partitioning?
                            
                                Spark unable to download kafka library
                            
                                spark select columns by type
                            
                                Submit Spark Application on Kubernetes in Cluster mode : Configured service account doesn't have access
                            
                                How can I integrate xgboost in spark? (Python)
                            
                                Spark 3.0 is much slower to read json files than Spark 2.4
                            
                                How to compute the mean with Apache spark?
                            
                                Spark Streaming Window Operation
                            
                                Apache Spark - How does internal job scheduler in spark define what are users and what are pools
                            
                                Running custom Java class in PySpark
                            
                                On Spark's RDD's take and takeOrdered methods
                            
                                Operate on neighbor elements in RDD in Spark
                            
                                Cannot load main class from JAR file in Spark Submit
                            
                                Spark job did not find table in Hive database
                            
                                Kryo serializer causing exception on underlying Scala class WrappedArray
                            
                                Calculate the running time for spark sql

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark: Is receiver in spark streaming a bottleneck?

Tags:

apache-spark

spark-streaming

nagendra

People also ask

1 Answers

Yuval Itzchakov

Recent Activity

Donate For Us