Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Kafka vs. MongoDB for time series data

I'm contemplating on whether to use MongoDB or Kafka for a time series dataset.

At first sight obviously it makes sense to use Kafka since that's what it's built for. But I would also like some flexibility in querying, etc.

Which brought me to question: "Why not just use MongoDB to store the timestamped data and index them by timestamp?"

Naively thinking, this feels like it has the similar benefit of Kafka (in that it's indexed by time offset) but has more flexibility. But then again, I'm sure there are plenty of reasons why people use Kafka instead of MongoDB for this type of use case.

Could someone explain some of the reasons why one may want to use Kafka instead of MongoDB in this case?

like image 982
Vlad Avatar asked Sep 04 '18 15:09

Vlad


2 Answers

I'll try to take this question as that you're trying to collect metrics over time

Yes, Kafka topics have configurable time retentions, and I doubt you're using topic compaction because your messages would likely be in the form of (time, value), so the time could not be repeated anyway.

Kafka also provides stream processing libraries so that you can find out averages, min/max, outliers&anamolies, top K, etc. values over windows of time.

However, while processing all that data is great and useful, your consumers would be stuck doing linear scans of this data, not easily able to query slices of it for any given time range. And that's where time indexes (not just a start index, but also an end) would help.

So, sure you can use Kafka to create a backlog of queued metrics and process/filter them over time, but I would suggest consuming that data into a proper database because I assume you'll want to be able to query it easier and potentially create some visualizations over that data.

With that architecture, you could have your highly available Kafka cluster holding onto data for some amount of time, while your downstream systems don't necessarily have to be online all the time in order to receive events. But once they are, they'd consume from the last available offset and pickup where they were before

like image 169
OneCricketeer Avatar answered Oct 24 '22 19:10

OneCricketeer


Like the answers in the comments above - neither Kafka nor MongoDB are well suited as a time-series DB with flexible query capabilities, for the reasons that @Alex Blex explained well.

Depending on the requirements for processing speed vs. query flexibility vs. data size, I would do the following choices:

  1. Cassandra [best processing speed, best/good data size limits, worst query flexibility]
  2. TimescaleDB on top of PostgresDB [good processing speed, good/OK data size limits, good query flexibility]
  3. ElasticSearch [good processing speed, worst data size limits, best query flexibility + visualization]

P.S. by "processing" here I mean both ingestion, partitioning and roll-ups where needed P.P.S. I picked those options that are most widely used now, in my opinion, but there are dozens and dozens of other options and combinations, and many more selection criteria to use - would be interested to hear about other engineers' experiences!

like image 20
Marina Avatar answered Oct 24 '22 20:10

Marina