I understood the automated mapping that exists between a Kafka Partition and a Spark RDD partition and ultimately Spark Task. However in order to properly Size My Executor (in number of Core) and therefore ultimately my node and cluster, I need to understand something that seems to be glossed over in the documentations. In Spark-Streaming how does exactly work the data consumption vs data processing vs task allocation, in other words: <blockquote> <ol> <li>Does a corresponding Spark task to a Kafka partition both read and process the data altogether ?</li> </ol> </blockquote> <ul> <li>The rational behind this question is that in the previous API, that is, the receiver based, a TASK was dedicated for receiving the data, meaning a number tasks slot of your executors were reserved for data ingestion and the other were there for processing. This had an impact on how you size your executor in term of cores. </li> <li>Take for example the advise on how to launch spark-streaming with --master local. Everyone would tell that in the case of spark streaming, one should put local[2] minimum, because one of the core, will be dedicated to running the long receiving task that never ends, and the other core will do the data processing.</li> <li>So if the answer is that in this case, the task does both the reading and the processing at once, then the question that follows, is that really smart, i mean, this sounds like asynchronous. We want to be able to fetch while we process so on the next processing the data is already there. However if there only one core or more precisely to both read the data and process them, how can both be done in parallel, and how does that make things faster in general.</li> <li>My original understand was that, things would have remain somehow the same in the sense that, a task would be launch to read but that the processing would be done in another task. That would mean that, if the processing task is not done yet, we can still keep reading, until a certain memory limit.</li> </ul> Can someone outline with clarity what is exactly going on here ? EDIT1 We don't even have to have this memory limit control. Just the mere fact of being able to fetch while the processing is going on and stopping right there. In other words, the two process should be asynchronous and the limit is simply to be one step ahead. To me if somehow this is not happening, i find it extremely strange that Spark would implement something that break performance as such.

<blockquote> Does a corresponding Spark task to a Kafka partition both read and process the data altogether ? </blockquote> The relationship is very close to what you describe, if by talking about a task we're referring to the part of the graph that reads from kafka up until a shuffle operation. The flow of execution is as follows: <ol> <li>Driver reads offsets from all kafka topics and partitions</li> <li>Driver assigns each executor a topic and partition to be read and processed.</li> <li>Unless there is a shuffle boundary operation, it is likely that Spark will optimize the entire execution of the partition on the same executor.</li> </ol> This means that a single executor will read a given <code>TopicPartition</code> and process the entire execution graph on it, unless we need to shuffle. Since a Kafka partition maps to a partition inside the <code>RDD</code>, we get that guarantee. Structured Streaming takes this even further. In Structured Streaming, there is stickiness between the <code>TopicPartition</code> and the worker/executor. Meaning, if a given worker was assigned a <code>TopicPartition</code> it is likely to continue processing it for the entire lifetime of the application.

Spark-Streaming Kafka Direct Streaming API & Parallelism

Tags:

apache-kafka

apache-spark

spark-streaming

I understood the automated mapping that exists between a Kafka Partition and a Spark RDD partition and ultimately Spark Task. However in order to properly Size My Executor (in number of Core) and therefore ultimately my node and cluster, I need to understand something that seems to be glossed over in the documentations.

In Spark-Streaming how does exactly work the data consumption vs data processing vs task allocation, in other words:

Does a corresponding Spark task to a Kafka partition both read and process the data altogether ?

The rational behind this question is that in the previous API, that is, the receiver based, a TASK was dedicated for receiving the data, meaning a number tasks slot of your executors were reserved for data ingestion and the other were there for processing. This had an impact on how you size your executor in term of cores.
Take for example the advise on how to launch spark-streaming with
--master local. Everyone would tell that in the case of spark streaming, one should put local[2] minimum, because one of the core, will be dedicated to running the long receiving task that never ends, and the other core will do the data processing.
So if the answer is that in this case, the task does both the reading and the processing at once, then the question that follows, is that
really smart, i mean, this sounds like asynchronous. We want to be
able to fetch while we process so on the next processing the data is already there. However if there only one core or more precisely to
both read the data and process them, how can both be done in
parallel, and how does that make things faster in general.
My original understand was that, things would have remain somehow the same in the sense that, a task would be launch to read but that the
processing would be done in another task. That would mean that, if
the processing task is not done yet, we can still keep reading, until a certain memory limit.

Can someone outline with clarity what is exactly going on here ?

EDIT1

We don't even have to have this memory limit control. Just the mere fact of being able to fetch while the processing is going on and stopping right there. In other words, the two process should be asynchronous and the limit is simply to be one step ahead. To me if somehow this is not happening, i find it extremely strange that Spark would implement something that break performance as such.

377

asked Aug 05 '17 21:08

MaatDeamon

1 Answers

Does a corresponding Spark task to a Kafka partition both read and process the data altogether ?

The relationship is very close to what you describe, if by talking about a task we're referring to the part of the graph that reads from kafka up until a shuffle operation. The flow of execution is as follows:

Driver reads offsets from all kafka topics and partitions
Driver assigns each executor a topic and partition to be read and processed.
Unless there is a shuffle boundary operation, it is likely that Spark will optimize the entire execution of the partition on the same executor.

This means that a single executor will read a given TopicPartition and process the entire execution graph on it, unless we need to shuffle. Since a Kafka partition maps to a partition inside the RDD, we get that guarantee.

Structured Streaming takes this even further. In Structured Streaming, there is stickiness between the TopicPartition and the worker/executor. Meaning, if a given worker was assigned a TopicPartition it is likely to continue processing it for the entire lifetime of the application.

answered Oct 13 '22 00:10

Yuval Itzchakov

Related questions
                            
                                How to give predicted and label columns in BinaryClassificationMetrics evaluation for Naive Bayes model
                            
                                Not able to fetch result from hive transaction enabled table through spark-sql
                            
                                How to write dataframe (obtained from hive table) into hadoop SequenceFile and RCFile?
                            
                                How to convert RDD to DataFrame in Spark Streaming, not just Spark
                            
                                Apache Toree and Spark Scala Not Working in Jupyter
                            
                                Spark never finishes jobs and stages, JobProgressListener crash
                            
                                The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwx--------- (on Linux)
                            
                                How to implement a ScalaTest FunSuite to avoid boilerplate Spark code and import implicits
                            
                                Accessing Spark Mllib Bisecting K-means tree data
                            
                                Am I fully utilizing my EMR cluster?
                            
                                How to log malformed rows from Scala Spark DataFrameReader csv
                            
                                How to transform Dataset<Tuple2<String,DeviceData>> to Iterator<DeviceData>
                            
                                Naive install of PySpark to also support S3 access
                            
                                Broadcast a user defined class in Spark
                            
                                Do not discard keys with null values when converting to JSON in PySpark DataFrame
                            
                                Running Python startup code after modules are loaded
                            
                                How to use PySpark to load a rolling window from daily files?
                            
                                What is the difference between tensorflow on spark with the default distributed tensorflow 1.0?
                            
                                Spark error - Decimal precision 39 exceeds max precision 38
                            
                                Unsupported literal type class in Apache Spark in scala

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With