Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DynamoDb: How to retrieve the first item (by sort key) for each of a given list of partition keys

I have a dynamodb table that stores historical run data for processes that run on my server, I need a place where I can aggregate these processes and see the data for the latest of each of these. Each process has it's own ProcessId which is the partition key for the dynamodb table. The sort key is the StartDateTime

{
  ProcessId, // Partition Key
  StartDateTime, // Sort Key
  ... // More data
}

Essentially I need to retrieve the most recent StartDateTime for each ProcessId that I give. I'm using a nodejs lambda with the aws-sdk to retrieve the data. I've looked into using BatchGetItem but my understanding is that for tables with a Partition Key and Sort Key, you need to provide both to retrieve an item. I've also looked into using a Query, but I would need to run a separate query for each Partition which is less than Ideal. Does anyone know of a way I can make this request in one call rather than having to make a separate call per Partition?

like image 254
Luke Avatar asked Jan 10 '20 00:01

Luke


People also ask

Can you Query by sort key in DynamoDB?

You can Query any table or secondary index, provided that it has a composite primary key (partition key and sort key). Query operations consume read capacity units, as follows. The table's provisioned read capacity.

Can you have multiple sort keys in DynamoDB?

How many sort keys can DynamoDB have? There should only be one sort key defined per table. But, it can be composed using multiple columns.

How does sort key work in DynamoDB?

The sort key of an item is also known as its range attribute. The term range attribute derives from the way DynamoDB stores items with the same partition key physically close together, in sorted order by the sort key value. Each primary key attribute must be a scalar (meaning that it can hold only a single value).

What are the two ways of retrieving data from a DynamoDB table?

In Amazon DynamoDB, you can use either the DynamoDB API, or PartiQL, a SQL-compatible query language, to query an item from a table. With Amazon DynamoDB the Query action lets you retrieve data in a similar fashion. The Query action provides quick, efficient access to the physical locations where the data is stored.


1 Answers

To sum up what I understood from your post you may have data like this in your table:

PK (id)         SK (timestamp)    Other data
process1        1                 ...
process2        4                 ...
process1        8                 ...
process3        18                ...
process2        25                ...

Your need is to easily retrieve:

process1        8                 ...
process2        25                ...
process3        18                ...

As sandboxbohemian said, I suggest you a stream to trigger a lambda function each time a new input arrives. However, I would use the same table and upsert an item with the same id and a timestamp equal to 0. In addition I add a binary attribute "latest" with always set to "True" and a number attribute for the current timestamp. Chronologically the entries would be:

PK (id)         SK (timestamp)    Other data      timestamp2(GSI SK)  latest (GSI PK)
process1        1                 ...                      
process1        0                 ...             1                   true
process2        4                 ...                      
process2        0                 ...             4                   true
process1        8                 ...                      
process1        0                 ...             8        
process3        18                ...                      
process3        0                 ...             18                  true       
process2        25                ...                      
process2        0                 ...             25                  true       

Then you have to create a GSI with PK equals to "latest" and SK equals to "timestamp" and project "id" and "data" attributes. It will be a sparse index meaning that only item with a latest attribute filled in will be present. Here after is the content:

latest (GSI PK) timestamp2 (GSI SK)   id        timestamp   Data
true            8                     process1  0           ...
true            25                    process2  0           ...    
true            18                    process3  0           ...   

As you see the the PK has always the same value. Therefore it allows doing a query or a scan. If you need all last process you can make a scan. If the number of process is really high you can make a query with latest=True and take advantage of sorting capabilities regarding timestamp2.

I agree this schema is not intuitive but it is often the case with dynamodb

like image 149
ben11 Avatar answered Oct 05 '22 20:10

ben11