Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS DynamoDB v2: Do I need secondary index for alternative queries?

I need to create a table that would contain a slice of data produced by a continuously running process. This process generates messages that contain two mandatory components, among other things: a globally unique message UUID, and a message timestamp.

Those messages would be later retrieved by the UUID.

In addition, on a regular basis I would need to delete all messages from that table that are too old, i.e. whose timestamps are more than X away from the current time.

I've been reading the DynamoDB v2 documentation (e.g. Local Secondary Indexes) trying to figure out how to organize my table and whether or not I need a secondary index to perform searches for messages to delete. There might be a simple answer to my question, but I am somehow confused...

So should I just create a table with the UUID as the hash and messageTimestamp as the range key (together with a "message" attribute that would contain the actual message), and then not create any secondary indices? In the examples that I've seen, the hash was something that was not unique (e.g. ForumName under the above link). In my case, the hash would be unique. I am not sure whether than makes any difference.

And if I create the table with hash and range as described, and without a secondary index, then how would I query for all messages that are in a certain timerange regardless of their UUIDs?

like image 266
I Z Avatar asked Oct 04 '22 06:10

I Z


2 Answers

DynamoDB introduced Global Secondary Index which would solve this problem. http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html

like image 76
user3096338 Avatar answered Oct 07 '22 19:10

user3096338


We've wrestled with this as well. The best solution we've come up with is to create second table for storing the time series data. To do this:

1) Use the date plus "bucket" id for a hash key
You could just use the date, but then I'm guessing today's date would become a "hot" key - one that is written with excessive frequency. This can create a serious bottleneck, as the total throughput for a particular DynamoDB partition is equal to the total provisioned throughput divided by the number of partitions - that means if all your writes are to a single key (today's key) and you have a throughput of 20 writes per second, then with 20 partitions, your total throughput would be 1 write per second. Any requests beyond this would be throttled. Not a good situation.

The bucket can be a random number from 1 to n, where n should be greater than the number of partitions used by the underlying DB. Determining n is a bit tricky of course because Dynamo does not reveal how many partitions it uses. But we are currently working with the upper limit of 200 based on the example found here. The writeup at this link was the basis for our thinking in coming up with this approach.

2) Use the UUID for the range key

3) Query records by issuing queries for each day and bucket. This may seem tedious, but it is more efficient than a full scan. Another possibility is to use Elastic Map Reduce jobs, but I have not tried that myself yet so cannot say how easy/effective it is to work with.

We are still figuring this out ourselves, so I'm interested to hear others' comments. I also found this presentation very helpful in thinking through how best to use Dynamo: Falling In and Out Of Love with Dynamo

-John

like image 33
John Kelvie Avatar answered Oct 07 '22 18:10

John Kelvie