Spark + Kafka integration - mapping of Kafka partitions to RDD partitions

Tags:

I have a couple of basic questions related to Spark Streaming

[Please let me know if these questions have been answered in other posts - I couldn't find any]:

(i) In Spark Streaming, is the number of partitions in an RDD by default equal to the number of workers?

(ii) In the Direct Approach for Spark-Kafka integration, the number of RDD partitions created is equal to the number of Kafka partitions. Is it right to assume that each RDD partition i would be mapped to the same worker node j in every batch of the DStream? ie, is the mapping of a partition to a worker node based solely on the index of the partition? For example, could partition 2 be assigned to worker 1 in one batch and worker 3 in another?

Thanks in advance

648

asked Sep 30 '15 18:09

jithinpt

1 Answers

i) default parallelism is number of cores (or 8 for mesos), but the number of partitions is up to the input stream implementation

ii) no, the mapping of partition indexes to worker nodes is not deterministic. If you're running kafka on the same nodes as your spark executors, the preferred location to run the task will be on the node of the kafka leader for that partition. But even then, a task may be scheduled on another node.

196

answered Oct 13 '22 11:10

Cody Koeninger

Related questions
                            
                                Scala error Could not find implicit value for parameter
                            
                                Python style decorator in Scala
                            
                                Scala - How to "delay" expression's compilation
                            
                                Scala implicit ambiguity doesn't get resolved without annoying dummy argument to mark the type.
                            
                                How to restrict processing to specified number of cores in spark standalone
                            
                                How to install library with SBT libraryDependencies in an Intellij project
                            
                                split string by char
                            
                                How to calculate the mean of each pair in an RDD consisting of (Key, [Value]) pairs in Spark?
                            
                                Counting regex matches in Scala?
                            
                                Circular Dependency Error for Google Guice with Play2.4 and scala
                            
                                How to create a VertexId in Apache Spark GraphX using a Long data type?
                            
                                Compose "Insert...Select...Where" query
                            
                                Is Java FileInputStream Locking File for writing
                            
                                Scala name mangling of private fields and JavaFX FXML injection
                            
                                how to bind RequestReader to Route in Finch
                            
                                Difference between higher-kinded type members declaration
                            
                                Akka scheduler stops on exception; is it expected?
                            
                                In Apache-spark, how to add the sparse vector?
                            
                                Fold on NonEmptyList
                            
                                Constraining Type Signatures for Left and Right

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark + Kafka integration - mapping of Kafka partitions to RDD partitions

Tags:

scala

apache-kafka

apache-spark

spark-streaming

apache-spark-1.4

jithinpt

People also ask

1 Answers

Cody Koeninger

Recent Activity

Donate For Us