I am looking to setup Kafka as an intermediary between data coming from IoT machines and a service that will process that data. I am having some issues identifying the proper way to design my topics based on my usecase and would love some advice.
I am looking to read sensor data from many machines, and each machine could have many sensors. eg( temperature, pressure, parts etc..) The order of these messages that my consumers will read is imporant and needs to be sequential.
I have come up with three possible designs but I am not sure which is best, if any?
a) Each machine will write to a specific topic with 1 partition to guarantee sequence. so machine 100 will write to topics called : machine100TempSensor1, machine100TempSensor2, machine100PressureSensor1 etc..
b) all machines will write to a single topic but the partitions will be based on machine/sensor so using the same example as above, machine 100 will write to a topic called 'temperature' but will be keyd on the machine and sensor.
eg.
(Topic: temperature, partition : machine100TempSensor1)
(Topic: temperature, partition : machine100TempSensor2)
(Topic: temperature, partition : machine200TempSensor1)
c) produce all temperature related messages to a temperature topic and filter the messages as I process the data.
My concerns with all solutions,
a)
- Kafka guarantees sequence on the partition level only, so would creating a topic with a single partition be a good idea or does that go against what a topic should be?
- If I wanted to read 'Temperature' from all machines, I would have to know the names and request data from specific topics instead of a general 'Temperature' topic.
- Kafka states that only one consumer group can read from a single partition, so I would have to create many consumer groups.
b)
- A single 'temperature' topic could possibly have 30+ partitions if not 100s/1000s if I consider scaling. (but I would have the benefit of reading all partitions at once)
- Since only a single consumer group is able to read from a single partition, I will have a consumer group for every consumer.
c)
- I feel there could be a big performance cost in filtering thousands of useless messages.
- I will run into the same issue when it comes time to pushing the processed data to kafka.
Something to consider is that I would like to have the ability to process certain machines/sensors.
Hopefully I have been able to explain everything clearly.
A Kafka cluster should have a maximum of 200,000 partitions across all brokers when managed by Zookeeper. The reason is that if brokers go down, Zookeeper needs to perform a lot of leader elections. Confluent still recommends up to 4,000 partitions per broker in your cluster.
But here are a few general rules: maximum 4000 partitions per broker (in total; distributed over many topics) maximum 200,000 partitions per Kafka cluster (in total; distributed over many topics) resulting in a maximum of 50 brokers per Kafka cluster.
Partitions are the way that Kafka provides redundancy. Kafka keeps more than one copy of the same partition across multiple brokers. This redundant copy is called a replica. If a broker fails, Kafka can still serve consumers with the replicas of partitions that failed broker owned.
Your overall understanding of Kafka is not 100% correct.
1) Kafka basically scales over partitions -- thus, for the brokers, there is no difference (from a performance perspective) if you use 1 topic with 1000 partitions of 1000 topics with 1 partition each. (If you plan to use Kafka Streams (aka Streams API), using a singe topic with 1000 partitions would be better, because Kafka Streams does not scale very good across topics.)
2) Creating single partition topics to guarantee ordering if basically absolutely fine. For subscribing to multiple topics at once, you could use pattern subscription if you name the topics accordingly.
3) A single broker can host multiple thousand partitions. Thus, even with replication taken into account, you don't need a huge cluster.
4) This claim sounds incorrect (or maybe I miss understand it):
Kafka states that only one consumer group can read from a single partition, so I would have to create many consumer groups.
Maybe you mean, only one consumer within a single consumer group. That it would be correct. If you have a consumer group, you can assign (either manual or using built-in consumer group management) each partition to at most one consumer within the group. You only need multiple consumer groups if multiple applications want to read the same partition.
5) Your concern about (c) seems legit.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With