I am trying to figure out the best wat to Fanout my data into separate placeholders to be consumed by other processed
Use case I am receiving ticker data for several scripts (~2000 stocks) in a Kafka topic. I want to be able to run KPI's on all scripts individually (KPI's are like formulas that are applied to input data to get KPI values).
Options that I can think of
Fanout to topics based on the script name All that received on the source topic is sent to a different topic by the script name. The problem here is that this will create huge number of topics and managing them and keeping track of stuff becomes a tedious task.

Keeping all tick data in a single topic and partition it by script name using CustomPartitioner. This helps by keeping the count of the topic low and system easy to manage. but all consumers need to discard a huge number of records to get their chunk of data causing latency. (In other words, a job looking to run KPI on Apple Tick will need to subscribe to the common topic and discard ticks from all other scripts)

Please let me know if there is a better way to do it and if not which one to choose. Another important consideration is each KPI will write data back to Kafka topics to be further consumed by a rule engine
As I see it:
1. Fanout to topics based on the script name
PROS
CONS
2. Keeping all tick data in a single topic and partition it by script name using CustomPartitioner
PROS
CONS
It may be easier to maintain (from the hardware/resources point of view), but much harder to correctly develop.
Your partitioner must be updated continuosly (if a new source comes in, if partition number grows,...), and this may become a tedious manual task.
Forget about different management of sources: all your incoming data, regardless of the source, will share the same topic parameters, for example, retention time; You will have no option to persist some sources more than others, or to (easily) distribute it in a bigger number of partitions, etc.
Smaller, lighter sources will be affected by bigger sources, as all the data is processed in the same topic. If you launch consumer groups, the "small"-source consumers will have to discard much more messages in order to reach to their own messages.
On the other hand, if you don't launch consumer groups and assign consumers-by-hand, you'll need to manually increase the number of partitions of the
topic, assigning the new ones to the big sources. This will involve constant
changes on your partitioner and on the assignation of your consumers.
Anyway, if you have control of the source scripts, you could get rid of the second topic/topics, as you could create the same logic into the source topic, and avoid the movement of data (which I believe is not transformed, just moved from one place to other) from the source topic to the end-topics. This is more visible on the 2nd approach (why not partition on the first topic?)
Hope it helps, some of these are totally subjective. : )
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With