Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to query DynamoDB by date (range key), with no obvious hash key?

I need to keep local data on an iOS app in sync with data in a DynamoDB table. The DynamoDB table is ~2K rows, with only a hash key (id), and the following attributes:

  • id (uuid)
  • lastModifiedAt (timestamp)
  • name
  • latitude
  • longitude

I am currently scanning and filtering by lastModifiedAt, where lastModifiedAt is greater than the app's last refresh date, but I imagine that will become expensive.

The best answer I can find is to add a Global Secondary Index with lastModifiedAt as the range, but there is no obvious hash key for the GSI.

What is best practice when needing to query by range using a GSI, but there is no obvious hash key? Alternatively, if a full scan is the only option, are there any best practices to keep down the cost?

like image 421
James Skidmore Avatar asked Mar 12 '16 20:03

James Skidmore


People also ask

Can you Query without sort key DynamoDB?

Hash key in DynamoDB The primary reason for that complexity is that you cannot query DynamoDB without the hash key. So, it's not allowed to query the entire database. That means you cannot do what you would call a full table scan in other databases.

Can range key be null in DynamoDB?

Can the DynamoDB sort key be null? DynamoDB does not support null for sort key.

Is Range Key optional DynamoDB?

A primary key is consists of a hash key and an optional range key. Hash key is used to select the DynamoDB partition. Partitions are parts of the table data.

Is there a way to Query multiple hash keys in DynamoDB?

There is no way to query by multiple hash keys, but, as of April 2014, you can use QueryFilter so you can filter by non key fields in addition to hash key fields.


2 Answers

Although a Global Secondary Index seems to fit your requirements, any attempt to include timestamp related information as part of your Hash Key will most likely create what is known as "Hot Partition", which is extremely undesirable.

The uneven access will occur as the most recent items are going to be retrieved with way more frequency than the old ones. This will not only impact your performance but also make your solution less cost effective.

See some details from the documentation:

For example, if a table has a very small number of heavily accessed partition key values, possibly even a single very heavily used partition key value, request traffic is concentrated on a small number of partitions – potentially only one partition. If the workload is heavily unbalanced, meaning that it is disproportionately focused on one or a few partitions, the requests will not achieve the overall provisioned throughput level. To get the most out of DynamoDB throughput, create tables where the partition key has a large number of distinct values, and values are requested fairly uniformly, as randomly as possible.

Based on what is stated, the id seems indeed to be a good choice for your Hash Key (aka. Partition Key), I wouldn't change that as GSI keys work in the same way as far as partitioning. As a separate note, performance is highly optimized when you retrieve your data by providing the entire Primary Key, so we should try to find a solution that provides that whenever possible.

I would suggest creating separate tables to store the primary keys based on how recent they were updated. You can segment the data into tables based on the granularity that best fits your use cases. For example, say that you want to segment the updates by day:

a. Your daily updates could be stored in tables with the following naming convention: updates_DDMM

b. The updates_DDMM tables would only have the id's (hash keys of the other table)

Now say that the latest app refresh date was from 2 days ago (04/07/16) and you need to get the recent records, you would then need:

i. Scan the tables updates_0504 and updates_0604 to get all the hash keys.

ii. Finally obtain the records from the main table (containing lat/lng, name, etc) by submitting a BatchGetItem with all the obtained hash keys.

BatchGetItem is super fast and will do the job like no other operation.

One can argue that creating additional tables will add cost to your overall solution... well, with GSI you are essentially duplicating your table (in case you are projecting all fields) and adding that additional cost for all ~2k records, being them recently updated or not...

It seems counter intuitive creating tables like this but it is actually a best practice when dealing with time series data (From AWS DynamoDB Documentation):

[...] the applications might show uneven access pattern across all the items in the table where the latest customer data is more relevant and your application might access the latest items more frequently and as time passes these items are less accessed, eventually the older items are rarely accessed. If this is a known access pattern, you could take it into consideration when designing your table schema. Instead of storing all items in a single table, you could use multiple tables to store these items. For example, you could create tables to store monthly or weekly data. For the table storing data from the latest month or week, where data access rate is high, request higher throughput and for tables storing older data, you could dial down the throughput and save on resources.

You can save on resources by storing "hot" items in one table with higher throughput settings, and "cold" items in another table with lower throughput settings. You can remove old items by simply deleting the tables. You can optionally backup these tables to other storage options such as Amazon Simple Storage Service (Amazon S3). Deleting an entire table is significantly more efficient than removing items one-by-one, which essentially doubles the write throughput as you do as many delete operations as put operations.

Source: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html

I hope that helps. Regards.

like image 79
b-s-d Avatar answered Sep 18 '22 13:09

b-s-d


While D.Shawley's answer helped point me in the right direction, it missed two considerations for a GSI:

  1. The hash+range need to be unique, yet day+timestamp (his recommended approach) would not necessarily be unique.
  2. By using only the day as the hash, I would need to use a large number of queries to get the results for each day since the last refresh date (which could be months or years ago).

As such, here is the approach I took:

  • Created a Global Secondary Index (GSI) with the hash key as YearMonth (e.g., 201508) and range as id
  • Query the GSI multiple times, one query for each month since the last refresh date. The queries are also filtered with lastModifiedAt > [given timestamp].
like image 42
James Skidmore Avatar answered Sep 17 '22 13:09

James Skidmore