I would like to store 1M+ different time series in Amazon's DynamoDb database. Each time series will have about 50K data points. A data point is comprised of a timestamp and a value.
The application will add new data points to time series frequently (all the time) and will retrieve (usually the whole time series) time series from time to time, for analytics.
How should I structure the database? Should I create a separate table for each timeseries? Or should I put all data points in one table?
Assuming your data is immutable and given the size, you may want to consider Amazon Redshift; it's written for petabyte-sized reporting solutions.
In Dynamo, I can think of a few viable designs. In the first, you could use one table, with a compound hash/range key (both strings). The hash key would be the time series name, the range key would be the timestamp as an ISO8601 string (which has the pleasant property that alphabetical ordering is also chronological ordering), and there would be an extra attribute on each item; a 'value'. This gives you the abilty to select everything from a time series (Query on hashKey equality) and a subset of a time series (Query on hashKey equality and rangeKey BETWEEN clause). However, your main problem is the "hotspot" problem: internally, Dynamo will partition your data by hashKey, and will disperse your ProvisionedReadCapacity over all your partitions. So you may have 1000 KB of reads a second, but if you have 100 partitions, then you have only 10 KB a second for each partition, and reading all data from a single time series (single hashKey) will only hit one partition. So you may think your 1000 KB of reads gives you 1 MB a second, but if you have 10 MB stored it might take you much longer to read it, as your single partition will throttle you much more heavily.
On the upside, DynamoDB has an extremely high but costly upper-bound on scaling; if you wanted you could pay for 100,000 Read Capacity units, and have sub-second response times on all of that data.
Another theoretical design would be to store every time series in a separate table, but I don't think DynamoDB is meant to scale to millions of tables, so this is probably a no-go.
You could try and spread out your time series across 10 tables where "highly read" data goes in table 1, "almost never read data" in table 10, and all other data somewhere in between. This would let you "game" the provisioned throughput / partition throttling rules, but at a high degree of complexity in your design. Overall, it's probably not worth it; where do you new time series? How do you remember where they all are? How do you move a time series?
I think DynamoDB supports some internal "bursting" on these kinds of reads from my own experience, and it's possible my numbers are off, and you will get adequete performance. However my verdict is to look into Redshift.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With