Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Better method for querying DynamoDB table randomly?

I've included some links along with our approaches to other answers, which seem to be the most optimal on the web right now.

Our records need to be categorized (eg. "horror", "thriller", "tv"), and randomly accessible both in specific categories and across all/some categories. We generally need to access about 20 - 100 items at a time. We also have a smallish number of categories (less than 100).

We write to the database for uploading/removing content, although this is done in batches and does not need to be real time.

We have tried two different approaches, with two different data structures.

Approach 1

AWS DynamoDB - Pick a record/item randomly?

Help selecting nth record in query.

In short, using the category as a hash key, and a UUID as the sort key. Generate a random UUID, query Dynamo using greater than or less than, and limit to 1. This is even suggested by an AWS employee in the second link. (We've also tried increasing the limit to the number of items we need, but this increases the probability of the query failing the first time around).

Issues with this approach:

  • First query can fail if it is greater than/less than any of the UUIDs
  • Querying on any specific category will cause throttling at scale (Small number of partitions)

We've also considered adding a suffix to each category to artificially increase the number of partitions we have, as pointed out in the following link.

AWS Database Blog Choosing the Right DynamoDB Partition Key

Approach 2

Amazon Web Services: How do we get random item from the dynamoDb's table?

Doing something similar to this, where we concatenate the category with a sequential number, and use this as the hash key. e.g. horror-000001.

By knowing the number of records in each category, we're able to perform random queries across our entire data set, while also avoiding hot partitions/keys.

Issues with this approach

  • We need a secondary data structure to manage the sequential counts across each category
  • Writing (especially deleting) is significantly more complex, although this doesn't need to happen in real time.

Conclusion

Both approaches solve our main use case of random queries on category/categories, but the cons they offer are really deterring us from using them. We're leaning more towards approach #1 using suffixes to solve the hot partitioning issue, although we would need the additional retry logic for failed queries.

Is there a better way of approaching this problem? Specifically looking for solutions capable of scaling well (No scan), without requiring extra resources be implemented. #1 fits the bill, but needing to manage suffixes and failed attempts really deters us from using it, especially when it is being called inside a lambda (billed for time used).

Thanks!

like image 787
HMilbradt Avatar asked Jul 11 '18 23:07

HMilbradt


1 Answers

Follow Up

After more research and testing, my team has decided to move towards MySQL hosted on RDS for these tables. We learned that this is one of the few use cases were DynamoDB does not fit, and requires rewriting your use case to fit the DB (Bad).

We felt that the extra complexity required to integrate random sampling on DynamoDB wasn't worth it, and we were unable to come up with any comparable solutions. We are, however, sticking with DynamoDB for our tables that do not need random accessibility due to the price and response times.

For anyone wondering why we chose MySQL, it was largely due to the Nodejs library available, great online resources (which DynamoDB definitely lacks), easy integration via RDS with our Lambdas, and the option to migrate to Amazons Aurora database.

We also looked at PostgreSQL, but we weren't as happy with the client library or admin tools, and we believe that MySQL will suit our needs for these tables.

If anybody has anything else they'd like to add or a specific question please leave a comment or send me a message!

like image 135
HMilbradt Avatar answered Nov 05 '22 16:11

HMilbradt