Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Disperse messages in Kafka stream

I am trying to figure out the best wat to Fanout my data into separate placeholders to be consumed by other processed

Use case I am receiving ticker data for several scripts (~2000 stocks) in a Kafka topic. I want to be able to run KPI's on all scripts individually (KPI's are like formulas that are applied to input data to get KPI values).

Options that I can think of

  1. Fanout to topics based on the script name All that received on the source topic is sent to a different topic by the script name. The problem here is that this will create huge number of topics and managing them and keeping track of stuff becomes a tedious task.

    Disperse

  2. Keeping all tick data in a single topic and partition it by script name using CustomPartitioner. This helps by keeping the count of the topic low and system easy to manage. but all consumers need to discard a huge number of records to get their chunk of data causing latency. (In other words, a job looking to run KPI on Apple Tick will need to subscribe to the common topic and discard ticks from all other scripts)

    Data in single topic partitioned by script name

Please let me know if there is a better way to do it and if not which one to choose. Another important consideration is each KPI will write data back to Kafka topics to be further consumed by a rule engine

like image 891
Count Avatar asked Oct 27 '25 01:10

Count


1 Answers

As I see it:

1. Fanout to topics based on the script name

PROS

  • Total control of individual placeholders. You could set different replication, persistence, partition number params for each of your sources.
  • Easy health checks. You could easily check if a source (p.e, Citigroup) is getting more incoming data than others, or delete data from an specific source which is affecting the process, etc.
  • Independent scale. Related to the previous point, if you have different topics for each source, this will help in scalation issues. With the second approach, you are tied to the number of partitions of just one topic, in order to scale your consumers (and it may be that you just reached that max number of consumers). With this 1st approach, you create different topics (and thus, different partitions), allowing you to launch consumers in those places (topics) in which you need to, or to increase the number of partitions of an individual topic.

CONS

  • A lot of topics, which could lead to a lot of replication, persistence, offset control,etc: more work for the broker...
  • ... and for the maintainer. : )

2. Keeping all tick data in a single topic and partition it by script name using CustomPartitioner

PROS

  • Less broker/maintainer saturation. All the data is on a topic, in which you "only" must control the partitioning.
  • Launching different consumer groups into a single topic is totally OK. Anyway, you have two options here:
    • Launch different CGs to all partitions: each worker must check the key of the message, and discard it if it's not his "source". This is really fast, and shouldn't cause that much latency.
    • Forget about CG: If you partition based on the source, you could send specific workers to specific partitions, via assign. This way, even if all the data is in one topic, it will be effectively divided inside the partitions, so your consumers won't discard anything. Keep in mind that you could only have one consumer per partition even if assign doesn't work with subscribe rules, because even if it's possible to assign the same partition to 2 different threads, all the data in that partition will be processed twice.

CONS

  • It may be easier to maintain (from the hardware/resources point of view), but much harder to correctly develop.

  • Your partitioner must be updated continuosly (if a new source comes in, if partition number grows,...), and this may become a tedious manual task.

  • Forget about different management of sources: all your incoming data, regardless of the source, will share the same topic parameters, for example, retention time; You will have no option to persist some sources more than others, or to (easily) distribute it in a bigger number of partitions, etc.

  • Smaller, lighter sources will be affected by bigger sources, as all the data is processed in the same topic. If you launch consumer groups, the "small"-source consumers will have to discard much more messages in order to reach to their own messages. On the other hand, if you don't launch consumer groups and assign consumers-by-hand, you'll need to manually increase the number of partitions of the topic, assigning the new ones to the big sources. This will involve constant changes on your partitioner and on the assignation of your consumers.


Anyway, if you have control of the source scripts, you could get rid of the second topic/topics, as you could create the same logic into the source topic, and avoid the movement of data (which I believe is not transformed, just moved from one place to other) from the source topic to the end-topics. This is more visible on the 2nd approach (why not partition on the first topic?)

Hope it helps, some of these are totally subjective. : )

like image 180
aran Avatar answered Oct 28 '25 19:10

aran



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!