I need to keep local data on an iOS app in sync with data in a DynamoDB table. The DynamoDB table is ~2K rows, with only a hash key (<code>id</code>), and the following attributes: <ul> <li> <code>id</code> (uuid)</li> <li> <code>lastModifiedAt</code> (timestamp)</li> <li><code>name</code></li> <li><code>latitude</code></li> <li><code>longitude</code></li> </ul> I am currently scanning and filtering by <code>lastModifiedAt</code>, where <code>lastModifiedAt</code> is greater than the app's last refresh date, but I imagine that will become expensive. The best answer I can find is to add a Global Secondary Index with <code>lastModifiedAt</code> as the range, but there is no obvious hash key for the GSI. What is best practice when needing to query by range using a GSI, but there is no obvious hash key? Alternatively, if a full scan is the only option, are there any best practices to keep down the cost?

Although a <code>Global Secondary Index</code> seems to fit your requirements, any attempt to include <code>timestamp</code> related information as part of your <code>Hash Key</code> will most likely create what is known as "Hot Partition", which is extremely undesirable. The uneven access will occur as the most recent items are going to be retrieved with way more frequency than the old ones. This will not only impact your performance but also make your solution less cost effective. See some details from the documentation: <blockquote> For example, if a table has a very small number of heavily accessed partition key values, possibly even a single very heavily used partition key value, request traffic is concentrated on a small number of partitions – potentially only one partition. If the workload is heavily unbalanced, meaning that it is disproportionately focused on one or a few partitions, the requests will not achieve the overall provisioned throughput level. To get the most out of DynamoDB throughput, create tables where the partition key has a large number of distinct values, and values are requested fairly uniformly, as randomly as possible. </blockquote> Based on what is stated, the <code>id</code> seems indeed to be a good choice for your <code>Hash Key</code> (aka. <code>Partition Key</code>), I wouldn't change that as GSI keys work in the same way as far as partitioning. As a separate note, performance is highly optimized when you retrieve your data by providing the entire <code>Primary Key</code>, so we should try to find a solution that provides that whenever possible. I would suggest creating separate tables to store the primary keys based on how recent they were updated. You can segment the data into tables based on the granularity that best fits your use cases. For example, say that you want to segment the updates by day: a. Your daily updates could be stored in tables with the following naming convention: <code>updates_DDMM</code> b. The <code>updates_DDMM</code> tables would only have the <code>id</code>'s (hash keys of the other table) Now say that the latest app refresh date was from 2 days ago (04/07/16) and you need to get the recent records, you would then need: i. Scan the tables <code>updates_0504</code> and <code>updates_0604</code> to get all the hash keys. ii. Finally obtain the records from the main table (containing lat/lng, name, etc) by submitting a <code>BatchGetItem</code> with all the obtained hash keys. <code>BatchGetItem</code> is super fast and will do the job like no other operation. One can argue that creating additional tables will add cost to your overall solution... well, with <code>GSI</code> you are essentially duplicating your table (in case you are projecting all fields) and adding that additional cost for all ~2k records, being them recently updated or not... It seems counter intuitive creating tables like this but it is actually a best practice when dealing with time series data (From AWS DynamoDB Documentation): <blockquote> [...] the applications might show uneven access pattern across all the items in the table where the latest customer data is more relevant and your application might access the latest items more frequently and as time passes these items are less accessed, eventually the older items are rarely accessed. If this is a known access pattern, you could take it into consideration when designing your table schema. Instead of storing all items in a single table, you could use multiple tables to store these items. For example, you could create tables to store monthly or weekly data. For the table storing data from the latest month or week, where data access rate is high, request higher throughput and for tables storing older data, you could dial down the throughput and save on resources. You can save on resources by storing "hot" items in one table with higher throughput settings, and "cold" items in another table with lower throughput settings. You can remove old items by simply deleting the tables. You can optionally backup these tables to other storage options such as Amazon Simple Storage Service (Amazon S3). Deleting an entire table is significantly more efficient than removing items one-by-one, which essentially doubles the write throughput as you do as many delete operations as put operations. </blockquote> Source: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html I hope that helps. Regards.

While D.Shawley's answer helped point me in the right direction, it missed two considerations for a GSI: <ol> <li>The hash+range need to be unique, yet day+timestamp (his recommended approach) would not necessarily be unique.</li> <li>By using only the day as the hash, I would need to use a large number of queries to get the results for each day since the last refresh date (which could be months or years ago).</li> </ol> As such, here is the approach I took: <ul> <li>Created a Global Secondary Index (GSI) with the hash key as <code>YearMonth</code> (e.g., <code>201508</code>) and range as <code>id</code> </li> <li>Query the GSI multiple times, one query for each month since the last refresh date. The queries are also filtered with <code>lastModifiedAt > [given timestamp]</code>.</li> </ul>

How to query DynamoDB by date (range key), with no obvious hash key?

Tags:

amazon-web-services

amazon-dynamodb

aws-sdk

I need to keep local data on an iOS app in sync with data in a DynamoDB table. The DynamoDB table is ~2K rows, with only a hash key (id), and the following attributes:

id (uuid)
lastModifiedAt (timestamp)
name
latitude
longitude

I am currently scanning and filtering by lastModifiedAt, where lastModifiedAt is greater than the app's last refresh date, but I imagine that will become expensive.

The best answer I can find is to add a Global Secondary Index with lastModifiedAt as the range, but there is no obvious hash key for the GSI.

What is best practice when needing to query by range using a GSI, but there is no obvious hash key? Alternatively, if a full scan is the only option, are there any best practices to keep down the cost?

421

asked Mar 12 '16 20:03

James Skidmore

2 Answers

Although a Global Secondary Index seems to fit your requirements, any attempt to include timestamp related information as part of your Hash Key will most likely create what is known as "Hot Partition", which is extremely undesirable.

The uneven access will occur as the most recent items are going to be retrieved with way more frequency than the old ones. This will not only impact your performance but also make your solution less cost effective.

See some details from the documentation:

For example, if a table has a very small number of heavily accessed partition key values, possibly even a single very heavily used partition key value, request traffic is concentrated on a small number of partitions – potentially only one partition. If the workload is heavily unbalanced, meaning that it is disproportionately focused on one or a few partitions, the requests will not achieve the overall provisioned throughput level. To get the most out of DynamoDB throughput, create tables where the partition key has a large number of distinct values, and values are requested fairly uniformly, as randomly as possible.

Based on what is stated, the id seems indeed to be a good choice for your Hash Key (aka. Partition Key), I wouldn't change that as GSI keys work in the same way as far as partitioning. As a separate note, performance is highly optimized when you retrieve your data by providing the entire Primary Key, so we should try to find a solution that provides that whenever possible.

I would suggest creating separate tables to store the primary keys based on how recent they were updated. You can segment the data into tables based on the granularity that best fits your use cases. For example, say that you want to segment the updates by day:

a. Your daily updates could be stored in tables with the following naming convention: updates_DDMM

b. The updates_DDMM tables would only have the id's (hash keys of the other table)

Now say that the latest app refresh date was from 2 days ago (04/07/16) and you need to get the recent records, you would then need:

i. Scan the tables updates_0504 and updates_0604 to get all the hash keys.

ii. Finally obtain the records from the main table (containing lat/lng, name, etc) by submitting a BatchGetItem with all the obtained hash keys.

BatchGetItem is super fast and will do the job like no other operation.

One can argue that creating additional tables will add cost to your overall solution... well, with GSI you are essentially duplicating your table (in case you are projecting all fields) and adding that additional cost for all ~2k records, being them recently updated or not...

It seems counter intuitive creating tables like this but it is actually a best practice when dealing with time series data (From AWS DynamoDB Documentation):

[...] the applications might show uneven access pattern across all the items in the table where the latest customer data is more relevant and your application might access the latest items more frequently and as time passes these items are less accessed, eventually the older items are rarely accessed. If this is a known access pattern, you could take it into consideration when designing your table schema. Instead of storing all items in a single table, you could use multiple tables to store these items. For example, you could create tables to store monthly or weekly data. For the table storing data from the latest month or week, where data access rate is high, request higher throughput and for tables storing older data, you could dial down the throughput and save on resources.

You can save on resources by storing "hot" items in one table with higher throughput settings, and "cold" items in another table with lower throughput settings. You can remove old items by simply deleting the tables. You can optionally backup these tables to other storage options such as Amazon Simple Storage Service (Amazon S3). Deleting an entire table is significantly more efficient than removing items one-by-one, which essentially doubles the write throughput as you do as many delete operations as put operations.

Source: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html

I hope that helps. Regards.

answered Sep 18 '22 13:09

b-s-d

While D.Shawley's answer helped point me in the right direction, it missed two considerations for a GSI:

The hash+range need to be unique, yet day+timestamp (his recommended approach) would not necessarily be unique.
By using only the day as the hash, I would need to use a large number of queries to get the results for each day since the last refresh date (which could be months or years ago).

As such, here is the approach I took:

Created a Global Secondary Index (GSI) with the hash key as YearMonth (e.g., 201508) and range as id
Query the GSI multiple times, one query for each month since the last refresh date. The queries are also filtered with lastModifiedAt > [given timestamp].

answered Sep 17 '22 13:09

James Skidmore

Related questions
                            
                                What does the dualstack prefix mean in AWS ELB?
                            
                                s3.getObject().createReadStream() : How to catch the error?
                            
                                Can't delete AWS internet Gateway
                            
                                Why every time Elastic Beanstalk issues a command to its instance it always timed out?
                            
                                Linking Container in AWS Fargate
                            
                                Is it possible to use AWS as a web host?
                            
                                Elastic Beanstalk Ruby/Rails need to install git so bundle install works.. but is not
                            
                                connecting AWS SAM Local with dynamodb in docker
                            
                                AWS elastic-search. FORBIDDEN/8/index write (api). Unable to write to index
                            
                                Unable to load AWS credentials from the /AwsCredentials.properties file on the classpath
                            
                                Amazon Elastic Beanstalk node and npm non-standard install locations
                            
                                Amazon Web Service can't delete an Elastic Beanstalk environment
                            
                                How to create folder or key on s3 using AWS SDK for Node.js?
                            
                                How to delete multiple files in S3 bucket with AWS CLI
                            
                                Error "uninitialized constant AWS (NameError)"
                            
                                AWS API Gateway: User anonymous is not authorized to execute API
                            
                                Cross Account Alias Records
                            
                                create an ec2 instance with multiple key pairs
                            
                                AWS DynamoDB - Pick a record/item randomly?
                            
                                AWS ECS Task Memory Hard and Soft Limits

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to query DynamoDB by date (range key), with no obvious hash key?

Tags:

amazon-web-services

amazon-dynamodb

aws-sdk

James Skidmore

People also ask

2 Answers

b-s-d

James Skidmore

Recent Activity

Donate For Us