Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Kafka for Time Series Data Persistence

Tags:

apache-kafka

We have a system (say System A) that receives time series data via HTTP and this data is being persisted in OpenTSDB via the REST interface of the OpenTSDB. I would now like to introduce Apache Kafka into the system. The idea for me would be to have a Kafka server running, where System A as soon as receiving time series messages, publishes this message to the Apache Kafka server.

I can then have a consumer that reads from the topic and writes this data to the OpenTSDB. I have a couple of questions with this approach:

With respect to the architecting the Producer and Consumer:

  1. Can I have a standalone client where I will write consumers that just consume from the Kafka topic and write the messages into OpenTSDB

  2. The producers will be part of System A and will publish messages to the respective topic

With respect to Kafka topics, the time series data is some metrics that have a key and a value and example of which is as below:

 "metric.metricType.tagName"

I will be having hundreds or even perhaps thousands of these different tagNames. How do I structure this information and represent this as a topic in Apache Kafka. I'm not sure if there is a limit on the number of topics that I could create.

Should I have one topic per tagName? What is the deal with partitioning the topic?

With respect to Apache Kafka partitioning, I have the following questions:

  1. If I have a topic "Topic A" and have set partitions to 4 for this topic, and if my producer writes to this partition, in which partition of this topic will this message be available? Is the same message available across each partition within the same topic?

  2. If I write a consumer for this partitioned topic, how will this behave, I mean, will this consumer receive the message from the partition?

  3. If I have multiple consumers for this partitioned topic, will all of those consumers get the same messages? I mean if there are 4 partitions in the topic (TP1, TP2, TP3, TP4) and I have 4 consumer group (CG1, CG2, CG3, CG4) where in each consumer group, I have one consumer that reads the messages from the respective topic partition (C1 reads from TP1, C2 reads from TP2 and so on). Will I end up having duplicate messages if all my consumer groups writes the messages it receives to the same database?

like image 854
joesan Avatar asked Jan 11 '16 06:01

joesan


People also ask

Can I use Kafka as persistent storage?

Kafka Persistent StorageKakfa is architected for persistent storage. This is due to the fact that a Kakfa message log is always persistent, unlike that of most other messaging systems. Thus, a storage device attached to a Kakfa broker has to be persistent, too, retaining its data after it has been shut off.

How long does Kafka persist data?

The Kafka cluster retains all published messages—whether or not they have been consumed—for a configurable period of time. For example if the log retention is set to two days, then for the two days after a message is published it is available for consumption, after which it will be discarded to free up space.

Is Kafka a time series database?

Given its ability to ingest high volumes of data, Kafka is a natural part of any data architecture handling large volumes of time series telemetry, specifically as an intermediate buffer before that data is persisted in InfluxDB for processing, analysis, and use in other applications.

Can Kafka be used for real-time?

Kafka can act as a publisher/subscriber type of system, used for building a read-and-write stream for batch data similar to RabbitMQ. It can also be used for building highly resilient, scalable, real-time streaming and processing applications.


1 Answers

Can I have a standalone client where I will write consumers that just consume from the Kafka topic and write the messages into OpenTSDB?

Yes, that's how I did it. A standalone java app (you might call it a "java server app").

Should I have one topic per tagName?

If you'll want to treat messages with one tag differently that other tags, e.g. retention, message size (and other topic-level settings) then it makes sense to have a separate topic, but if you're gonna have thousands of tags I'd rather not do that. It can be just a simple field inside a message. You can have one topic that will be used for your metrics and then, when you'll want to add other types of messages (and you'll surely want to do that once you see the benefit:) you can create a different topic for that. You can roughly look at topics as entities in a database, but that's a rather weak comparison, since it depends on many factors, like size, incoming rate and similar stuff. There's no one-size-fits-all recipe, so you'll have to ask a separate, specific question, with all the parameters you have.

What is the deal with partitioning the topic?

Partitions are the consumption parallelism mechanism of Kafka (they also facilitate redundancy, since each partition is replicated across brokers, depending on replication factor you choose). Since partition cannot be consumed by more than one consumer thread, you'll want to create more partitions initially (and start consuming with a smaller number of threads), so that you can later increase number of threads up to number of partitions. (This restriction might have been lifted in the latest Kafka version, 0.9. This rule applies to low-level consumer of v0.8).

If I have a topic "Topic A" and have set partitions to 4 for this topic, and if my producer writes to this partition, in which partition of this topic will this message be available?

If you publish messages like you described, you won't know in which partition a message will end up. This is determined by hashing at the producer side and default hashing mechanism is random (something like "round robin"). You can control partitioning by determining the attribute(s) that will be used for hashing. E.g. if you include your tag in hash, all messages with a same tag will always go to the same partition. This is important when you want to assure that messages with the same tag get consumed in the same order they were put into Kafka i.e. produced.

Is the same message available across each partition within the same topic?

No, partitions always contain a roughly equal subset of messages of their topic (if default, random hashing is used).

If I write a consumer for this partitioned topic, how will this behave, I mean, will this consumer receive the message from the partition?

Messages will be consumed randomly, since there's no coordination between consumer threads. Understandably, of course, since that would incur a huge performance penalty.

If I have multiple consumers for this partitioned topic, will all of those consumers get the same messages?

This depends on the consumer group. All consumer threads that are in a same group receive in total 100% of messages (e.g. each of the 4 consumer threads would get 25% of messages from that topic). If, on the other hand, you have 2 consumers with different groups, they would each consume 100% of messages from that topic. I think you can deduce answers to your last two question from this, right?

like image 185
Marko Bonaci Avatar answered Dec 24 '22 11:12

Marko Bonaci