I'm contemplating on whether to use MongoDB or Kafka for a time series dataset.
At first sight obviously it makes sense to use Kafka since that's what it's built for. But I would also like some flexibility in querying, etc.
Which brought me to question: "Why not just use MongoDB to store the timestamped data and index them by timestamp?"
Naively thinking, this feels like it has the similar benefit of Kafka (in that it's indexed by time offset) but has more flexibility. But then again, I'm sure there are plenty of reasons why people use Kafka instead of MongoDB for this type of use case.
Could someone explain some of the reasons why one may want to use Kafka instead of MongoDB in this case?
I'll try to take this question as that you're trying to collect metrics over time
Yes, Kafka topics have configurable time retentions, and I doubt you're using topic compaction because your messages would likely be in the form of (time, value)
, so the time could not be repeated anyway.
Kafka also provides stream processing libraries so that you can find out averages, min/max, outliers&anamolies, top K, etc. values over windows of time.
However, while processing all that data is great and useful, your consumers would be stuck doing linear scans of this data, not easily able to query slices of it for any given time range. And that's where time indexes (not just a start index, but also an end) would help.
So, sure you can use Kafka to create a backlog of queued metrics and process/filter them over time, but I would suggest consuming that data into a proper database because I assume you'll want to be able to query it easier and potentially create some visualizations over that data.
With that architecture, you could have your highly available Kafka cluster holding onto data for some amount of time, while your downstream systems don't necessarily have to be online all the time in order to receive events. But once they are, they'd consume from the last available offset and pickup where they were before
Like the answers in the comments above - neither Kafka nor MongoDB are well suited as a time-series DB with flexible query capabilities, for the reasons that @Alex Blex explained well.
Depending on the requirements for processing speed vs. query flexibility vs. data size, I would do the following choices:
P.S. by "processing" here I mean both ingestion, partitioning and roll-ups where needed P.P.S. I picked those options that are most widely used now, in my opinion, but there are dozens and dozens of other options and combinations, and many more selection criteria to use - would be interested to hear about other engineers' experiences!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With