I am writing Kafka Consumer for high volume high velocity distributed application. I have only one topic but rate incoming messages is very high. Having multiple partition that serve more consumer would be appropriate for this use-case. Best way to consume is to have multiple stream readers. As per the documentation or available samples, number of KafkaStreams the ConsumerConnector gives out is based on number of topics. Wondering how to get more than one KafkaStream readers [based on the partition], so that I can span one thread per stream or Reading from same KafkaStream in multiple threads would do the concurrent read from multiple partitions?
Any insights are much appreciated.
Would like to share what I found from mailing list:
The number that you pass in the topic map controls how many streams a topic is divided into. In your case, if you pass in 1, all 10 partitions's data will be fed into 1 stream. If you pass in 2, each of the 2 streams will get data from 5 partitions. If you pass in 11, 10 of them will each get data from 1 partition and 1 stream will get nothing.
Typically, you need to iterate each stream in its own thread. This is because each stream can block forever if there is no new event.
Sample snippet:
topicCount.put(msgTopic, new Integer(partitionCount));
Map<String, List<KafkaStream<byte[], byte[]>>> consumerStreams = connector.createMessageStreams(topicCount);
List<KafkaStream<byte[], byte[]>> streams = consumerStreams.get(msgTopic);
for (final KafkaStream stream : streams) {
    ReadTask task = new ReadTask(stream, msgTopic);
    task.addObserver(this.msgObserver);
    tasks.add(task); executor.submit(task);
}
Reference: http://mail-archives.apache.org/mod_mbox/incubator-kafka-users/201201.mbox/%3CCA+sHyy_Z903dOmnjp7_yYR_aE2sRW-x7XpAnqkmWaP66GOqf6w@mail.gmail.com%3E
The recommended way to do this is to have a thread pool so Java can handle organisation for you and for each stream the createMessageStreamsByFilter method gives you consume it in a Runnable. For example:
int NUMBER_OF_PARTITIONS = 6;
Properties consumerConfig = new Properties();
consumerConfig.put("zk.connect", "zookeeper.mydomain.com:2181" );
consumerConfig.put("backoff.increment.ms", "100");
consumerConfig.put("autooffset.reset", "largest");
consumerConfig.put("groupid", "java-consumer-example");
consumer = Consumer.createJavaConsumerConnector(new ConsumerConfig(consumerConfig));
TopicFilter sourceTopicFilter = new Whitelist("mytopic|myothertopic");
List<KafkaStream<Message>> streams = consumer.createMessageStreamsByFilter(sourceTopicFilter, NUMBER_OF_PARTITIONS);
ExecutorService executor = Executors.newFixedThreadPool(streams.size());
for(final KafkaStream<Message> stream: streams){
    executor.submit(new Runnable() {
        public void run() {
            for (MessageAndMetadata<Message> msgAndMetadata: stream) {
                ByteBuffer buffer = msgAndMetadata.message().payload();
                byte [] bytes = new byte[buffer.remaining()];
                buffer.get(bytes);
                //Do something with the bytes you just got off Kafka.
            }
        }
    });
}
In this example I asked for 6 threads basically because I know that I have 3 partitions for each topic and I listed two topics in my whitelist. Once we have the handles of the incoming streams we can iterate over their content, which are MessageAndMetadata objects. Metadata is really just the topic name and offset. As you discovered you can do it in a single thread if you ask for 1 stream instead of, in my example 6, but if you require parallel processing the nice way is to launch an executor with one thread for each returned stream.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With