Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark + Kafka integration - mapping of Kafka partitions to RDD partitions

I have a couple of basic questions related to Spark Streaming

[Please let me know if these questions have been answered in other posts - I couldn't find any]:

(i) In Spark Streaming, is the number of partitions in an RDD by default equal to the number of workers?

(ii) In the Direct Approach for Spark-Kafka integration, the number of RDD partitions created is equal to the number of Kafka partitions. Is it right to assume that each RDD partition i would be mapped to the same worker node j in every batch of the DStream? ie, is the mapping of a partition to a worker node based solely on the index of the partition? For example, could partition 2 be assigned to worker 1 in one batch and worker 3 in another?

Thanks in advance

like image 648
jithinpt Avatar asked Sep 30 '15 18:09

jithinpt


People also ask

What is the difference between Spark streaming and structured streaming?

Spark receives real-time data and divides it into smaller batches for the execution engine. In contrast, Structured Streaming is built on the SparkSQL API for data stream processing. In the end, all the APIs are optimized using Spark catalyst optimizer and translated into RDDs for execution under the hood.

What is Kafka offset?

The offset is a simple integer number that is used by Kafka to maintain the current position of a consumer. That's it. The current offset is a pointer to the last record that Kafka has already sent to a consumer in the most recent poll. So, the consumer doesn't get the same record twice because of the current offset.

What is Spark streaming Kafka maxRatePerPartition?

An important one is spark. streaming. kafka. maxRatePerPartition which is the maximum rate (in messages per second) at which each Kafka partition will be read by this direct API. Deploying: This is same as the first approach.


1 Answers

i) default parallelism is number of cores (or 8 for mesos), but the number of partitions is up to the input stream implementation

ii) no, the mapping of partition indexes to worker nodes is not deterministic. If you're running kafka on the same nodes as your spark executors, the preferred location to run the task will be on the node of the kafka leader for that partition. But even then, a task may be scheduled on another node.

like image 196
Cody Koeninger Avatar answered Oct 13 '22 11:10

Cody Koeninger