I want to dive into stream processing with Kafka and I need some help to get my head around some design principals which are currently not very clear to me.
1.) assuming I have some real-time stock price data. would you make one topic "price" keyed (and therefore partitioned) by the stock symbol? Or would you make one topic per symbol? in example what happens if I decide to produce (add) some more stock symbols including a full history later on? now my history (ordering in the log) in the topic "price" is a mess, right? On the other hand for each price series I want to calculate the returns later on and if they are on different topics I have to keep track of them and start new stream applications for every single symbol.
2.) having now different real-time prices and I need to join an arbitary number of them into one big record. in example join all sp500 symbols into one record. since I do not have a price of all sp500 symbols for the very same time but maybe pretty close. how can I join them using always the latest price if one is missing at this exact time?
3.) say I have solved the join use case and I pump the joined records of all sp500 stocks back into Kafka. what do I need to do if I have made a mistake and I forgot one symbol? obviously, i want to add it to the stream. now I kind of need to whipe the "sp500" log and rebuild it right? or is there some mechanism to reset the starting offset to a particular offset (the one where I have fixed the join)? also most likely I have other stream applications which are consuming from this topic. They also need to do some kind of reset/replay. is it probably a better idea to not store the sp500 topic but make it part of a long stream process? but then I will end up doing the same join potentially several times.
4.) maybe this should be 1. since this is part of my goal ^^ - however, how could I model a data flow like such:
produce prices -> calculate returns -> join several returns into a row vector -> calculate covariance (window of rowvectors) -> join covariance with returns
-> -> into a tuple (ret, cov)
I am not even sure if such a complicated use case is possible using today's stream processing.
Kafka Streams is a client library for building applications and microservices, where the input and output data are stored in an Apache Kafka® cluster. It combines the simplicity of writing and deploying standard Java and Scala applications on the client side with the benefits of Kafka's server-side cluster technology.
Kafka Streams Architecture. Basically, by building on the Kafka producer and consumer libraries and leveraging the native capabilities of Kafka to offer data parallelism, distributed coordination, fault tolerance, and operational simplicity, Kafka Streams simplifies application development.
In Apache Kafka, streams are the continuous real-time flow of the facts or records(key-value pairs). Kafka Streams is a light-weight in-built client library which is used for building different applications and microservices. The input, as well as output data of the streams get stored in Kafka clusters.
Introduction. Apache Kafka is the most popular open-source distributed and fault-tolerant stream processing system. Kafka Consumer provides the basic functionalities to handle messages. Kafka Streams also provides real-time stream processing on top of the Kafka Consumer client.
When using Kafka I think of the messages as key/value pairs, stored in a distributed, persisted and replicated topic, sent as endless data-stream. The topic can get configured for different retention times and retention/(cleanup) methods.
1) How you organize your topics is up to you. You can do both basically and depending on how you want to use the data later both might make some sense. In your use case I would write the prices to one topic. The key should get picked like a primary key in a relational DB. It guarantees the order of values sent per key and might also get used for retention. BTW: you can consume from multiple streams/topics in one application.
2) What you want to use here is the so called "table/stream duality". (Side note: I think of streamed data as stateless and of a table as statefull.) So technically you construct a mapping (e.g. in memory) from a key to a value (the latest value to this key in the stream). Kafka Streams will do this for you with KTable. Kafka itself can also do this for you using an additional topic whith retention configured to keep only the latest value for a key. Some nice links:
3) The messages in the Kafka topic are stored based on your retention configuration. So you can configure it e.g. to store all data for 7 days. If you want to add data later but use some other time for it then its produce time you need to send a time as part of your message data and use this one when processing it later. For every consumer you can set/reset the offset where it should start reading. Which means you can go back and reprocess all the data which is still in your topic.
4) I am not sure what you are asking for because your flow seams to be fine for your goal. Kafka and stream processing is a good match for your use case.
In general I can recommend reading the Confluent blog, Confluent documentation and everything on the Kafka website. A lot of your questions depend on your requirements, hardware and what you want to do in the software so even with the given inforamtion I need to say "it depends". I hope this helps you and others starting with Kafka even if its just a quick try on explaining the concept and give some links as starting points.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With