Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Querying DynamoDB by date

People also ask

How does DynamoDB store dates?

Original Answer: You can do this now in DynamoDB by using GSI. Make the "CreatedAt" field as a GSI and issue queries like (GT some_date). Store the date as a number (msecs since epoch) for this kind of queries.

How do I sort DynamoDB query results?

Query results are always sorted by the sort key value. If the data type of the sort key is Number, the results are returned in numeric order; otherwise, the results are returned in order of UTF-8 bytes. By default, the sort order is ascending. To reverse the order, set the ScanIndexForward parameter to false.

Is DynamoDB good for querying?

Querying is a very powerful operation in DynamoDB. It allows you to select multiple Items that have the same partition ("HASH") key but different sort ("RANGE") keys.


Given your current table structure this is not currently possible in DynamoDB. The huge challenge is to understand that the Hash key of the table (partition) should be treated as creating separate tables. In some ways this is really powerful (think of partition keys as creating a new table for each user or customer, etc...).

Queries can only be done in a single partition. That's really the end of the story. This means if you want to query by date (you'll want to use msec since epoch), then all the items you want to retrieve in a single query must have the same Hash (partition key).

I should qualify this. You absolutely can scan by the criterion you are looking for, that's no problem, but that means you will be looking at every single row in your table, and then checking if that row has a date that matches your parameters. This is really expensive, especially if you are in the business of storing events by date in the first place (i.e. you have a lot of rows.)

You may be tempted to put all the data in a single partition to solve the problem, and you absolutely can, however your throughput will be painfully low, given that each partition only receives a fraction of the total set amount.

The best thing to do is determine more useful partitions to create to save the data:

  • Do you really need to look at all the rows, or is it only the rows by a specific user?

  • Would it be okay to first narrow down the list by Month, and do multiple queries (one for each month)? Or by Year?

  • If you are doing time series analysis there are a couple of options, change the partition key to something computated on PUT to make the query easier, or use another aws product like kinesis which lends itself to append-only logging.


Updated Answer:

DynamoDB allows for specification of secondary indexes to aid in this sort of query. Secondary indexes can either be global, meaning that the index spans the whole table across hash keys, or local meaning that the index would exist within each hash key partition, thus requiring the hash key to also be specified when making the query.

For the use case in this question, you would want to use a global secondary index on the "CreatedAt" field.

For more on DynamoDB secondary indexes see the secondary index documentation

Original Answer:

DynamoDB does not allow indexed lookups on the range key only. The hash key is required such that the service knows which partition to look in to find the data.

You can of course perform a scan operation to filter by the date value, however this would require a full table scan, so it is not ideal.

If you need to perform an indexed lookup of records by time across multiple primary keys, DynamoDB might not be the ideal service for you to use, or you might need to utilize a separate table (either in DynamoDB or a relational store) to store item metadata that you can perform an indexed lookup against.


Approach I followed to solve this problem is by created a Global Secondary Index as below. Not sure if this is the best approach but hopefully if it is useful to someone.

Hash Key                 | Range Key
------------------------------------
Date value of CreatedAt  | CreatedAt

Limitation imposed on the HTTP API user to specify the number of days to retrieve data, defaulted to 24 hr.

This way, I can always specify the HashKey as Current date's day and RangeKey can use > and < operators while retrieving. This way the data is also spread across multiple shards.


Your Hash key (primary of sort) has to be unique (unless you have a range like stated by others).

In your case, to query your table you should have a secondary index.

|  ID  | DataID | Created | Data |
|------+--------+---------+------|
| hash | xxxxx  | 1234567 | blah |

Your Hash Key is ID Your secondary index is defined as: DataID-Created-index (that's the name that DynamoDB will use)

Then, you can make a query like this:

var params = {
    TableName: "Table",
    IndexName: "DataID-Created-index",
    KeyConditionExpression: "DataID = :v_ID AND Created > :v_created",
    ExpressionAttributeValues: {":v_ID": {S: "some_id"},
                                ":v_created": {N: "timestamp"}
    },
    ProjectionExpression: "ID, DataID, Created, Data"
};

ddb.query(params, function(err, data) {
    if (err) 
        console.log(err);
    else {
        data.Items.sort(function(a, b) {
            return parseFloat(a.Created.N) - parseFloat(b.Created.N);
        });
        // More code here
    }
});

Essentially your query looks like:

SELECT * FROM TABLE WHERE DataID = "some_id" AND Created > timestamp;

The secondary Index will increase the read/write capacity units required so you need to consider that. It still is a lot better than doing a scan, which will be costly in reads and in time (and is limited to 100 items I believe).

This may not be the best way of doing it but for someone used to RD (I'm also used to SQL) it's the fastest way to get productive. Since there is no constraints in regards to schema, you can whip up something that works and once you have the bandwidth to work on the most efficient way, you can change things around.