The batches in spark streaming are the batches of RDD .Suppose batch of 3 RDDs.
Also spark documentation says that a block is created every 200ms by reciever , and partition is allotted to the block.
Say in 1 second I have batch of 3 RDDs , with 5 blocks if 200ms is considered.
So how will a RDD get partitioned across worker nodes , is the single RDD that will be partitioned or a complete batch.
I may have taken it in a wrong way . Please guide me
One streaming batch corresponds to one RDD. That RDD will have n partitions, where n = batch interval / block interval. Let's say you have the standard 200ms block interval and a batch interval of 2 seconds, then you will have 10 partitions. Blocks are created by a receiver, and each receiver is allocated in a host. So, those 10 partitions are in a single node and are replicated to a second node.
When the RDD is submitted for processing, the hosts running the task will read the data from that host. Tasks executing on the same node will have "NODE_LOCAL" locality, while tasks executing on other nodes will have "ANY" locality and will take longer.
Therefore, to improve parallel processing, it's recommended to allocate several receivers and use union to create a single DStream for further processing. That way data will be consumed and processed by several nodes in parallel.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With