Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Storm-Kafka multiple spouts, how to share the load?

I am trying to share the task among the multiple spouts. I have a situation, where I'm getting one tuple/message at a time from external source and I want to have multiple instances of a spout, main intention behind is to share the load and increase performance efficiency.

I can do the same with one Spout itself, but I want to share the load across multiple spouts. I am not able to get the logic to spread the load. Since the offset of messages will not be known until the particular spout finishes the consuming the part (i.e based on buffer size set).

Can anyone please put some bright light on the how to work-out on the logic/algorithm?

Advance Thanks for your time.


Update in response to answers:
Now used multi-partitions on Kafka (i.e 5)
Following is the code used:
builder.setSpout("spout", new KafkaSpout(cfg), 5);

Tested by flooding with 800 MB data on each partition and it took ~22 sec to finish read.

Again, used the code with parallelism_hint = 1
i.e. builder.setSpout("spout", new KafkaSpout(cfg), 1);

Now it took more ~23 sec! Why?

According to Storm Docs setSpout() declaration is as follows:

public SpoutDeclarer setSpout(java.lang.String id,
                              IRichSpout spout,
                              java.lang.Number parallelism_hint)

where,
parallelism_hint - is the number of tasks that should be assigned to execute this spout. Each task will run on a thread in a process somewhere around the cluster.

like image 781
Amol M Kulkarni Avatar asked Aug 16 '13 07:08

Amol M Kulkarni


1 Answers

I had come across a discussion in storm-user which discuss something similar.

Read Relationship between Spout parallelism and number of kafka partitions.


2 things to note while using kafka-spout for storm

  1. The maximum parallelism you can have on a KafkaSpout is the number of partitions.
  2. We can split the load into multiple kafka topics and have separate spout instances for each. ie. each spout handling a separate topic.

So if we have a case where kafka partitions per host is configured as 1 and the number of hosts is 2. Even if we set the spout parallelism as 10, the max value which is repected will only be 2 which is the number of partitions.


How To mention the number of partition in the Kafka-spout?

List<HostPort> hosts = new ArrayList<HostPort>();
hosts.add(new HostPort("localhost",9092));
SpoutConfig objConfig=new SpoutConfig(new KafkaConfig.StaticHosts(hosts, 4), "spoutCaliber", "/kafkastorm", "discovery");

As you can see, here brokers can be added using hosts.add and the partion number is specified as 4 in the new KafkaConfig.StaticHosts(hosts, 4) code snippet.


How To mention the parallelism hint in the Kafka-spout?

builder.setSpout("spout", spout,4);

You can mention the same while adding your spout into the topology using setSpout method. Here 4 is the parallelism hint.


More links that might help

Understanding-the-parallelism-of-a-Storm-topology

what-is-the-task-in-twitter-storm-parallelism


Disclaimer: !! i am new to both storm and java !!!! So pls edit/add if its required some where.

like image 139
Mithun Satheesh Avatar answered Sep 19 '22 05:09

Mithun Satheesh