I am using <code>spark 1.5.2</code>. I need to run spark streaming job with kafka as the streaming source. I need to read from multiple topics within kafka and process each topic differently. <ol> <li>Is it a good idea to do this in the same job? If so, should I create a single stream with multiple partitions or different streams for each topic?</li> <li>I am using Kafka direct steam. As far as I know, spark launches long-running receivers for each partition. I have a relatively small cluster, 6 nodes with 4 cores each. If I have many topics and partitions in each topic, would the efficiency be impacted as most executors are busy with long-running receivers? Please correct me if my understanding is wrong here </li> </ol>

I made the following observations, in case its helpful for someone: <ol> <li>In kafka direct stream, the receivers are not run as long running tasks. At the beginning of each batch inerval, first the data is read from kafka in executors. Once read, the processing part takes over.</li> <li>If we create a single stream with multiple topics, the topics are read one after the other. Also, filtering the dstream for applying different processing logic would add another step to the job</li> <li> Creating multiple streams would help in two ways: 1. You don't need to apply the filter operation to process different topics differently. 2. You can read multiple streams in parallel (as opposed to one by one in case of single stream). To do so, there is an undocumented config parameter <code>spark.streaming.concurrentJobs*</code>. So, I decided to create multiple streams. <pre class="prettyprint"><code>sparkConf.set("spark.streaming.concurrentJobs", "4"); </code></pre> </li> </ol>

Spark: processing multiple kafka topic in parallel

Tags:

I am using spark 1.5.2. I need to run spark streaming job with kafka as the streaming source. I need to read from multiple topics within kafka and process each topic differently.

Is it a good idea to do this in the same job? If so, should I create a single stream with multiple partitions or different streams for each topic?
I am using Kafka direct steam. As far as I know, spark launches long-running receivers for each partition. I have a relatively small cluster, 6 nodes with 4 cores each. If I have many topics and partitions in each topic, would the efficiency be impacted as most executors are busy with long-running receivers? Please correct me if my understanding is wrong here

913

asked Dec 23 '15 07:12

nish

2 Answers

I made the following observations, in case its helpful for someone:

In kafka direct stream, the receivers are not run as long running tasks. At the beginning of each batch inerval, first the data is read from kafka in executors. Once read, the processing part takes over.
If we create a single stream with multiple topics, the topics are read one after the other. Also, filtering the dstream for applying different processing logic would add another step to the job
Creating multiple streams would help in two ways: 1. You don't need to apply the filter operation to process different topics differently. 2. You can read multiple streams in parallel (as opposed to one by one in case of single stream). To do so, there is an undocumented config parameter spark.streaming.concurrentJobs*. So, I decided to create multiple streams.
```
sparkConf.set("spark.streaming.concurrentJobs", "4");
```

177

answered Sep 30 '22 22:09

nish

I think the right solution depends on your use case.

If your processing logic is the same for data from all topics, then without doubt, this is a better approach.

If the processing logic is different, i guess you get a single RDD from all the topics and you have to create a pairedrdd for each processing logic and handle it separately. The problem is that this creates a sort of grouping to processing and the overall processing speed will be determined by the topic which needs the longest time to process. So topics with less data have to wait till data from all topics are processed. One advantage is that if its a timeseries data, then the processing proceeds together which might be a good thing.

Another advantage of running independent jobs is that you get better control and can adjust your resource sharing. For eg: jobs which process topic with high throughput can be allocated a higher CPU/memory.

answered Sep 30 '22 21:09

Atul Soman

Related questions
                            
                                function to return an Interface
                            
                                undefined method `cache' for nil:NilClass after upgrading to rails 4.2.5.1
                            
                                JToken: Get raw/original JSON value
                            
                                How does Pythons SequenceMatcher work?
                            
                                Django on_delete=models.CASCADE has no effect at SQL level
                            
                                Build Angular2 HTML and TypeScript to a single file
                            
                                'ActiveRecord::Core::ClassMethods.find' call is deprecated
                            
                                Jenkins building a product consisting of many Maven projects? (with Jenkins Pipeline plugin?)
                            
                                What is "type '{}'"?
                            
                                Docker config file location on windows to, e.g., enable insecure registry / docker options
                            
                                Meaning of lib directory in github repositories
                            
                                Isn't it redundant for Control.Lens.Setter to wrap types in functors?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark: processing multiple kafka topic in parallel

Tags:

nish

People also ask

2 Answers

nish

Atul Soman

Recent Activity

Donate For Us