I need to keep local data on an iOS app in sync with data in a DynamoDB table. The DynamoDB table is ~2K rows, with only a hash key (id
), and the following attributes:
id
(uuid)lastModifiedAt
(timestamp)name
latitude
longitude
I am currently scanning and filtering by lastModifiedAt
, where lastModifiedAt
is greater than the app's last refresh date, but I imagine that will become expensive.
The best answer I can find is to add a Global Secondary Index with lastModifiedAt
as the range, but there is no obvious hash key for the GSI.
What is best practice when needing to query by range using a GSI, but there is no obvious hash key? Alternatively, if a full scan is the only option, are there any best practices to keep down the cost?
Hash key in DynamoDB The primary reason for that complexity is that you cannot query DynamoDB without the hash key. So, it's not allowed to query the entire database. That means you cannot do what you would call a full table scan in other databases.
Can the DynamoDB sort key be null? DynamoDB does not support null for sort key.
A primary key is consists of a hash key and an optional range key. Hash key is used to select the DynamoDB partition. Partitions are parts of the table data.
There is no way to query by multiple hash keys, but, as of April 2014, you can use QueryFilter so you can filter by non key fields in addition to hash key fields.
Although a Global Secondary Index
seems to fit your requirements, any attempt to include timestamp
related information as part of your Hash Key
will most likely create what is known as "Hot Partition", which is extremely undesirable.
The uneven access will occur as the most recent items are going to be retrieved with way more frequency than the old ones. This will not only impact your performance but also make your solution less cost effective.
See some details from the documentation:
For example, if a table has a very small number of heavily accessed partition key values, possibly even a single very heavily used partition key value, request traffic is concentrated on a small number of partitions – potentially only one partition. If the workload is heavily unbalanced, meaning that it is disproportionately focused on one or a few partitions, the requests will not achieve the overall provisioned throughput level. To get the most out of DynamoDB throughput, create tables where the partition key has a large number of distinct values, and values are requested fairly uniformly, as randomly as possible.
Based on what is stated, the id
seems indeed to be a good choice for your Hash Key
(aka. Partition Key
), I wouldn't change that as GSI keys work in the same way as far as partitioning. As a separate note, performance is highly optimized when you retrieve your data by providing the entire Primary Key
, so we should try to find a solution that provides that whenever possible.
I would suggest creating separate tables to store the primary keys based on how recent they were updated. You can segment the data into tables based on the granularity that best fits your use cases. For example, say that you want to segment the updates by day:
a. Your daily updates could be stored in tables with the following naming convention: updates_DDMM
b. The updates_DDMM
tables would only have the id
's (hash keys of the other table)
Now say that the latest app refresh date was from 2 days ago (04/07/16) and you need to get the recent records, you would then need:
i. Scan the tables updates_0504
and updates_0604
to get all the hash keys.
ii. Finally obtain the records from the main table (containing lat/lng, name, etc) by submitting a BatchGetItem
with all the obtained hash keys.
BatchGetItem
is super fast and will do the job like no other operation.
One can argue that creating additional tables will add cost to your overall solution... well, with GSI
you are essentially duplicating your table (in case you are projecting all fields) and adding that additional cost for all ~2k records, being them recently updated or not...
It seems counter intuitive creating tables like this but it is actually a best practice when dealing with time series data (From AWS DynamoDB Documentation):
[...] the applications might show uneven access pattern across all the items in the table where the latest customer data is more relevant and your application might access the latest items more frequently and as time passes these items are less accessed, eventually the older items are rarely accessed. If this is a known access pattern, you could take it into consideration when designing your table schema. Instead of storing all items in a single table, you could use multiple tables to store these items. For example, you could create tables to store monthly or weekly data. For the table storing data from the latest month or week, where data access rate is high, request higher throughput and for tables storing older data, you could dial down the throughput and save on resources.
You can save on resources by storing "hot" items in one table with higher throughput settings, and "cold" items in another table with lower throughput settings. You can remove old items by simply deleting the tables. You can optionally backup these tables to other storage options such as Amazon Simple Storage Service (Amazon S3). Deleting an entire table is significantly more efficient than removing items one-by-one, which essentially doubles the write throughput as you do as many delete operations as put operations.
Source: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html
I hope that helps. Regards.
While D.Shawley's answer helped point me in the right direction, it missed two considerations for a GSI:
As such, here is the approach I took:
YearMonth
(e.g., 201508
) and range as id
lastModifiedAt > [given timestamp]
.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With