I am using DynamoDB tables with keys and throughput optimized for application use cases. To support other ad hoc administrative and reporting use cases I want to keep a complete backup in S3 (a day old backup is OK). Again, I cannot afford to scan the entire DynamoDB tables to do the backup. The keys I have are not sufficient to find out what is "new". How do I do incremental backups? Do I have to modify my DynamoDB schema, or add extra tables just to do this? Any best practices?
Update: DynamoDB Streams solves this problem.
DynamoDB Streams captures a time-ordered sequence of item-level modifications in any DynamoDB table, and stores this information in a log for up to 24 hours. Applications can access this log and view the data items as they appeared before and after they were modified, in near real time.
Solution overview. Amazon DynamoDB offers two types of backup: point-in-time recovery (PITR) and on-demand. PITR provides continuous backups of your table and enables you to restore your table data to any point in time in the preceding 35 days.
All of your data is stored on solid-state disks (SSDs) and is automatically replicated across multiple Availability Zones in an Amazon Region, providing built-in high availability and data durability. You can use global tables to keep DynamoDB tables in sync across Amazon Regions.
In this post, I show you how to use such an anti-pattern for DynamoDB, but it is a great fit for time-series data. Unless you opt for on-demand capacity mode, every DynamoDB access pattern requires a different allocation of read capacity units and write capacity units.
Point-in-Time Recovery: $0.20 per GB-month. On-demand (snapshot): $0.10 per GB-month. Restoring a backup: $0.15 per GB.
I see two options:
Generate the current snapshot. You'll have to read from the table to do this, which you can do at a very slow rate to stay under your capacity limits (Scan operation). Then, keep an in-memory list of updates performed for some period of time. You could put these in another table, but you'll have to read those, too, which would probably cost just as much. This time interval could be a minute, 10 minutes, an hour, whatever you're comfortable losing if your application exits. Then, periodically grab your snapshot from S3, replay these changes on the snapshot, and upload your new snapshot. I don't know how large your data set is, so this may not be practical, but I've seen this done with great success for data sets up to 1-2GB.
Add read throughput and backup your data using a full scan every day. You say you can't afford it, but it isn't clear if you mean paying for capacity, or that the scan would use up all the capacity and the application would begin failing. The only way to pull data out of DynamoDB is to read it, either strongly or eventually consistent. If the backup is part of your business requirements, then I think you have to determine if it's worth it. You can self-throttle your read by examining the ConsumedCapacityUnits
property on your results. The Scan operation has a Limit property that you can use to limit the amount of data read in each operation. Scan also uses eventually consistent reads, which are half the price of strongly consistent reads.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With