Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is DynamoDB right for my 1M events-per-day scenario where I need access to both records and summary (aggregate) information

Software engineer from the C world breaking into the cloud space. I am trying to understand if I can use DynamoDB for all my scenario needs, if I shouldn't use it at all or if I should use DynamoDB for some scenarios then offload to another database (maybe a more traditional RDMS) for other scenarios.

I have a bunch of machines which send telemetry to my AWS APIs. My current idea is to put these events in a DynamoDB table. In the first year I am expecting about 1M new entries total in the table per day (average > 10 per second), with peak times being closer to 100 events per second. The size of the telemetry events will be 0.5-1KB. Firstly I simply want to capture every telemetry and log it. That is about 1GB per day.

I would use RDMS, but I am concerned that with the ACID properties, as the database size grows given we have 10s to 100s writes per second, database access will slow down significantly. Am I right to be concerned, or will a traditional RDMS be suitable for my needs? I don't feel I need the atomic guarantees of an RDMS (I can live with eventual consistency) especially for reading/aggregating queries.

I was thinking my primary key would be a composite key between machine ID (partition) and time unix millisecond epoch (sort key). This should support a key scenario scenario such as retrieving a list of all telemetry events for a specific time ranges and for a set of machines.

However there are then other things I wish to do such as the ability to get daily totals for various telemetry events, like the amount of times a machine sensor was triggered (i.e. count). Or list all the telemetry events recorded when the temperature was above a certain value for a set of machines. Telemetry events are heterogeneous - there are many types such as a temperature, triggered, and so on, and I need to be able to quickly filter for a specific type of telemetry event for a given set of machineIDs and time window.

So you can imagine one of my queries might look like 'for this time window, and this set of machines, list all temperature (type) telemetry events where a temperature over 70 degrees was recorded'. The first 2 constraints come from my partition (machineId) and sort key (time), but the latter one would query on all telemetry types that are temperature and have a value greater than 70. That's 4 fields I'm constraining on. Should I accept the reality that I can query on machineId and time, but will have to scan for the telemetry type and value from the results return in that query?

Further, this dynamodb table ingests telemetry, but the frontend needs to present summary information of things such as last time telemetry event received, last time machine serviced, daily counts of specific telemetry events. I don't want to have to re-query them every time the summary page reloads in the frontend. Should these aggregates be calculated when the telemetry arrives and then stored in a separate table (or perhaps if performance needs it, in memory something like redis). Is this technically stream processing?

I read online that good design typically only uses one table (single table design), but I am finding it hard to understand how to get there. And maybe it doesn't make sense in my scenario where I want to be able to view individual telemetry events but also have quick access to aggregate information.

After writing out this question, here is what my intuition is telling me right now:

  1. Do use dynamodb
  2. As you ingest telemetry events create aggregates for most common user scenarios (hourly, daily, weekly, monthly totals) and store them somewhere for quick access. Investigate stream processing and/or kinesis here.
  3. For less common user scnearios requiring aggregates, use another technology like Redshift, Athena

Thanks, Thomas

like image 613
friartuck Avatar asked Jan 05 '21 20:01

friartuck


People also ask

What is a stream in DynamoDB?

DynamoDB streams allow developers to receive and update item-level data before and after changes in that data. This is because DynamoDB streams provide a time-ordered sequence of changes made to the data within the last 24 hours.

What are the major roadblocks to analyze data in DynamoDB?

Here are a few of the major roadblocks you will run into once you start analyzing data in DynamoDB. Online analytical processing and data warehousing systems usually require huge amounts of aggregating, as well as the joining of dimensional tables, which are provided in a normalized or relational view of data.

What are the limitations of DynamoDB?

Let’s get started. The first important limit to know is the item size limit. An individual record in DynamoDB is called an item, and a single DynamoDB item cannot exceed 400KB. While 400KB is large enough for most normal database operations, it is significantly lower than the other options.

How can I write data at scale to DynamoDB?

Writing data at scale to DynamoDB must be done with care to be correct and cost effective. We started by setting the provisioned capacity high in the Airflow tasks or scheduled Databricks notebooks for each API data import (25,000+ writes per second) until the import was complete.


1 Answers

My current idea is to put these events in a DynamoDB table. In the first year I am expecting about 1M new entries total in the table per day (average > 10 per second), with peak times being closer to 100 events per second. The size of the telemetry events will be 0.5-1KB.

Sounds like a very good use case for DynamoDB — lots of relatively small writes per second.

I would use RDMS, but I am concerned that with the ACID properties, as the database size grows given we have 10s to 100s writes per second, database access will slow down significantly. Am I right to be concerned, or will a traditional RDMS be suitable for my needs?

An RDBMS can scale to handle that kind of load, yes. Not trivial, but nothing out of this world, either.

Since you mention DynamoDB, I assume you're ok with other services in AWS — I'd suggest looking at Aurora for this kind of thing, if you decide to use an RDBMS.

But it sounds like a waste, especially because of this:

I don't feel I need the atomic guarantees of an RDMS (I can live with eventual consistency) especially for reading/aggregating queries.

So, it's pointing to not RDBMS. Not necessarily DynamoDB, though.

You'll need to be careful with the data model, though.

Now...

Should I accept the reality that [...] will have to scan for the telemetry type and value from the results return in that query?

Almost certainly not. Designing DynamoDB-based data models that, from start, require you to Scan a growing amount of data is a bad, bad idea. You'll hate yourself in the future for it if you go that way.

The very important thing to keep in mind here is that you don't want to have operations take longer due to increase in size (e.g., like your concern with RDBMS).

A Scan operation will take longer the larger the table is. So your system's performance will degrade over time as you accumulate more and more data in your table. You can possibly work around this by offloading "non-hot" data from a "main" table to an "archival" table, for example. But this increases the complexity of managing the whole system.

Instead, try to design your data model so that operations maintain their performance, regardless of the total volume of data you have. E.g., make sure that your queries are always restricted to operations on one (or O(1) at least) item collection (i.e., items that have the same partition key).

the frontend needs to present summary information [...]. I don't want to have to re-query them every time [...]. Should these aggregates be calculated when the telemetry arrives and then stored in a separate table (or perhaps if performance needs it, in memory something like redis). Is this technically stream processing?

Stream processing is a good approach here.

You are absolutely right in avoiding recalculating this summary information every time the frontend is displayed. Aggregating it "on the fly" and storing it somewhere is a very good approach. The frontend would then simply have to "show the most recent pre-calculated piece of data": trivial.

Now, the question becomes when to aggregate, where to store, and what to do if this fails.

  • When to aggregate.

You could do it as the telemetry arrives, yes, but if you're going to be writing it to DynamoDB, keep in mind that you can use DynamoDB Streams — every new item gets written into a durable stream that can be replayed (in case you need to recalculate something) and you can have it automatically invoke a Lambda function, for example, to perform your aggregation.

  • Where to store.

The risk of storing in memory is that you increase the chance of losing your summary, and having to recalculate it. I'd prefer at least some durability here. The best solution would depend on a lot of factors, including how you're serving this frontend (static html? a non-cacheable API request?), and how often you want it to be updated (every X seconds? every new piece of telemetry data?).

One thing that you may want to investigate here is Kinesis Data Analytics: it can run all the aggregation automatically for you, and write the results to some durable storage that you choose. You can then decide how that summary gets into the front end (e.g., request from it every time, write it into something like Redis, invert control and push it to servers serving the frontend so they don't have to make requests, etc).

  • What to do if this fails.

This part is very important. You don't want your frontend to break if the summary is either outdated due to a failure in the processing pipeline, or if the data is missing in case of loss of the in-memory cache servers or unavailability of the durable storage. So it's important to make sure you have a clear idea of how you'll handle these situations.

The other area you should think about here is what to do if you lose your current aggregation. The DynamoDB contains all the measurements, so you should be able to recalculate all the summary. But remember — the data is growing, and this will require a scan, so it may take a very long time to recalculate from scratch.

You may want to write "checkpoints" (e.g., "all summary data until day X") or coarser pre-aggregations (e.g., "all summary data only for day X") assuming that the statistics you are calculating can be correctly derived from pre-aggregations (e.g., you can do it with a "average" or a "max"; you cannot do it with a p99). Saving those checkpoints to S3 can be a good idea. Alternatively, using multiple DynamoDB tables as I mentioned elsewhere can be useful (a "hot" table which is being added to, and a "non-hot" table to which you move archived data, and you always maintain a summary of this archived table somewhere; this way, you would only have to scan the small portion of "hot" data).

I read online that good design typically only uses one table (single table design), but I am finding it hard to understand how to get there.

Yeah... be careful with that. While I don't think it's inherently a bad design, the fact is that a lot of people are absurdly over-hyping it. The main issue is that it's incredibly more difficult to evolve the design.

With that said, keep in mind that this is not an RDBMS. I say this because I got the feeling that that's something you're more used to. But designing a data model in DynamoDB is very different (single-table or not). Duplication of data is a normal part of it, "normal forms" are not desirable.

The main thing to always keep in mind is that you want your read and write operations both to be O(1) with the volume of data, and at the same time you want to have flexibility to query all the information that you need. Single-table isn't something that automatically gives you that, nor necessarily simplify achieving it.


After writing out this question, here is what my intuition is telling me right now: [...]

I'd say you have very good intuition. Spot on.

Regarding point number 3, Athena and Redshift, keep in mind that they're usually better suited for "batch processing" (although not exclusively). In other words, you don't want your frontend, for example, ever sending a query to either and waiting for the results. You may want a process to query either of them and store the pre-calculated result on, e.g., S3, which will then be displayed on something like your frontend. But they should be kept separate from the low-latency requests.

like image 197
Bruno Reis Avatar answered Oct 19 '22 04:10

Bruno Reis