I am using spark 1.5.2
. I need to run spark streaming job with kafka as the streaming source. I need to read from multiple topics within kafka and process each topic differently.
You can't have multiple consumers that belong to the same group in one thread and you can't have multiple threads safely use the same consumer. One consumer per thread is the rule. To run multiple consumers in the same group in one application, you will need to run each in its own thread.
Multi-Topic Consumers We may have a consumer group that listens to multiple topics. If they have the same key-partitioning scheme and number of partitions across two topics, we can join data across the two topics.
Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.
Kafka act as the central hub for real-time streams of data and are processed using complex algorithms in Spark Streaming. Once the data is processed, Spark Streaming could be publishing results into yet another Kafka topic or store in HDFS, databases or dashboards.
I made the following observations, in case its helpful for someone:
Creating multiple streams would help in two ways: 1. You don't need to apply the filter operation to process different topics differently. 2. You can read multiple streams in parallel (as opposed to one by one in case of single stream). To do so, there is an undocumented config parameter spark.streaming.concurrentJobs*
. So, I decided to create multiple streams.
sparkConf.set("spark.streaming.concurrentJobs", "4");
I think the right solution depends on your use case.
If your processing logic is the same for data from all topics, then without doubt, this is a better approach.
If the processing logic is different, i guess you get a single RDD from all the topics and you have to create a pairedrdd for each processing logic and handle it separately. The problem is that this creates a sort of grouping to processing and the overall processing speed will be determined by the topic which needs the longest time to process. So topics with less data have to wait till data from all topics are processed. One advantage is that if its a timeseries data, then the processing proceeds together which might be a good thing.
Another advantage of running independent jobs is that you get better control and can adjust your resource sharing. For eg: jobs which process topic with high throughput can be allocated a higher CPU/memory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With