Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DynamoDB: Best hash/sort keys for my use case [confusion with AppSync/GraphQL]

I plan on using AWS Cognito for user auth, DynamoDB for persistence and AppSync (and a lot of Mobile Hub) to power the API - a Book Review site.

I'm having a hard time determining which field should be my hash key and which should be my sort key, and which LSI/GSI I should create.

I have a list of Books with details like so:

type Book {
  isbn: Int!
  year: Int!
  title: String!
  description: String
  front_cover_photo_url: String
  genre_ids: [Int]
  count_thumbs: Int
  us_release_date: String
  upcoming_release: Boolean
  currently_featured_in_book_stores: Boolean
  best_seller: Boolean
  reviews: [Review]
}

I also have a review record each time a user writes a review about a book.

type Review {
  isbn: Int!
  id: ID!
  created_at: String!

  # The user that submitted the review
  user_id: String!

  # The number of thumbs out of 5
  thumbs: Int!

  # Comments on the review
  comments: String!
}

Books, in my case, can have multiple genres - e.g."Fantasy" and "Drama". Books also have reviews by Users, whose data is stored in Cognito. We will display the reviews in reverse chronological order next to every book.

QUESTION 1: If I denormalize and use Drama as a genre instead of Genre ID 2, then what if I need to rename the genre later to Dramatic... wouldn't I need to update every item?

I need to be able to answer, at a minimum:

  • Get all books currently featured in book stores [currently_featured_in_book_stores == True]
  • Get all books that are "upcoming" [upcoming_release == True]
  • Get all books sorted by most thumbs [sort by count_thumbs DESC]
  • Get all books that are in genre "Comedy" [genre_ids contains 123 or "Comedy" depending on answer to Q1]
  • Query for book(s) named "Harry Potter" [title LIKE '%Harry Potter%']
  • Get all books with ISBN 1, 2, 3, 4, or 9 [ isbn IN [1,2,3,4,9] ]

QUESTION 2: What's the best way to structure the book data in DynamoDB, and which hash/sort/LSI/GSI would you use?

Since I'm using Cognito, the user profile data is stored outside of DynamoDB.

QUESTION 3: Should I have a User table in DynamoDB and dual write new registrations, so I can use AppSync to populate the review's details when showing their review? If not, how would I get the user's username/first name/last name when populating the book review details?

QUESTION 4: Since we've gone this far, any suggestions for the graphql schema?

like image 838
Z Jones Avatar asked May 06 '18 06:05

Z Jones


People also ask

Can I query using sort key DynamoDB?

You can not query only using a Sort Key. You need to specify a partition key to perform query operations. Else, you need to create a global secondary index or perform a scan operation.

Does sort key have to be unique DynamoDB?

In a DynamoDB table, the combined partition key value and sort key value for each item must be unique. However, in a local secondary index, the sort key value does not need to be unique for a given partition key value.

Does DynamoDB support sorting?

In an Amazon DynamoDB table, the primary key that uniquely identifies each item in the table can be composed not only of a partition key, but also of a sort key. Well-designed sort keys have two key benefits: They gather related information together in one place where it can be queried efficiently.

Can we have multiple sort keys in DynamoDB?

Generally in DynamoDB you can create Local Secondary Indexes if you need alternative sort key: To give your application a choice of sort keys, you can create one or more local secondary indexes on an Amazon DynamoDB table and issue Query or Scan requests against these indexes.


1 Answers

I would encourage you to read this answer. I have previously written to provide some general background on choosing keys. You should also open the links from that answer, which provide most of the key information AWS make available on the subject.

Before providing an answer I think I should also give the caveat that data architecture typically takes into account lots of factors. You've put some really good information in the question but inevitably there is not enough to provide a definitive 'best' solution. And indeed even with more information you would get different opinions.

Question 2

That said, here is what I would be thinking about doing in your case. I would be looking at creating a table called Books and a table called BookReviews.

Table: Books
Partition Key: ISBN

Table: BookReviews
Partition Key: ISBN
Sort Key: BookReview-id

I would not be looking to create any GSIs or LSIs.

Most of your queries involve finding 'all books' and ordering them in some way. These lists do not sound time sensitive. For example when a user asks for the most popular 100 books do they need to know the most popular books, including every vote counted up until the last second? I doubt it. Additionally are these lists specific to individual users? It doesn't sound like it.

My general tip is this; store your raw data in DynamoDB, and update it in real time. Create your common lists of books and update them once in a while (perhaps daily), store these lists in a cache. Optionally you could store these lists in DynamoDB in separate tables and query them in the event your cache is destroyed.

Get all books currently featured in book stores

 var params = {
  TableName: "Books",
  ExpressionAttributeValues: {
   ":a": {
     BOOL: true
    }
  }, 
  FilterExpression: "currently_featured_in_book_stores = :a"
 };
 dynamodb.scan(params, function(err, data) {
   if (err) console.log(err, err.stack); // an error occurred
   else     console.log(data);           // successful response
 });

This operation will retrieve all books that are currently featured in book stores. It uses a scan. If you not already familiar with scan, query and getItem you should definitely spend some time reading about them.

A scan evaluates every item in a table, for this reason scans sometimes don't scale well on large tables and can be expensive if you are only retrieving a few items. A query uses the partition key to return a set of items and is therefore typically fast and efficient. You can use a sort key in a query to quickly return a range of items from within a partition. GetItem uses the unique primary key and is very efficient.

If your table had 100 items, ANY scan you perform will cost 100 RCUs. If you perform a query, and only 2 items are in the queried partition, it would cost you 2 RCUs.

If a significant proportion of items in the Books table have currently_featured_in_book_stores=true, I would do a scan. If only a small number of items in the table have currently_featured_in_book_stores=true AND this is a very frequent query, you could consider creating a GSI on the Books table with partition key of currently_featured_in_book_stores and sort key of ISBN.

Imagine your books table has 100 books, and 50 have currently_featured_in_book_stores=true. Doing a scan costs 100 RCUs and won't cost much more than a query. Now imagine only one book has currently_featured_in_book_stores=true, perfoming a scan would cost 100 RCUs but a query would only cost 1 RCU. However you should think hard before adding GSIs, they do not share throughput with the base table, and you have to purchase RCUs separately for your GSI. If you under provision a GSI it can end up being slower than a scan on a well provisioned base table.

A boolean value is a bad partition key and I would go for a scan here. That said if you created the GSI above your query would look like this:

 var params = {
  TableName: "Books",
  IndexName: "Index_Books_In_Stores",
  ExpressionAttributeValues: {
   ":v1": {
     BOOL: true
    }
  }, 
  KeyConditionExpression: "currently_featured_in_book_stores = :v1"
 };
 dynamodb.query(params, function(err, data) {
   if (err) console.log(err, err.stack); // an error occurred
   else     console.log(data);           // successful response
 });

Get all books that are upcoming

All of the above still applies. I would do a scan like this

var params = {
  TableName: "Books",
  ExpressionAttributeValues: {
   ":a": {
     BOOL: true
    }
  }, 
  FilterExpression: "upcoming_release = :a"
 };
 dynamodb.scan(params, function(err, data) {
   if (err) console.log(err, err.stack); // an error occurred
   else     console.log(data);           // successful response
 });

I would do this scan infrequently and cache the results in a temporary store (i.e. in application memory).

Get all books sorted by most thumbs

The important thing here is the 'Get all books...'. That tells you right away that a scan is probably going to the best approach. You can think of a query as a scan that only looks at one partition. You don't want to look at a partition of books, you want ALL the books, so a scan is the way to go.

The only way DynamoDB will return sorted items is if you perform a query on a table or index that has a sort key. In this case the items would automatically be returned in sorted order based on the sort key. So for this search, you just need to do a scan to get all the books, and then sort them by your chosen attribute (thumbs) client side. The scan simply returns all books and looks like this.

 var params = {
  TableName: "Books"
 };
 dynamodb.scan(params, function(err, data) {
   if (err) console.log(err, err.stack); // an error occurred
   else     console.log(data);           // successful response
 });

Again, I would do this scan very infrequently and cache the top books. You can order your cache and just retrieve the number of items you need, perhaps the top 10, 100 or 1000. If the user carried on paging beyond the scope of the cache, you might need to do a new scan. I think more likely you would just limit the number of items and stop the user paging any further.

Get all books that are in genre "Comedy"

Again, most likely I would do a scan infrequently and cache the list. You could consider adding a GSI with partition key genre and sort key ISBN. Personally I would start with the scan and cache approach and see how you get on. You can always add the GSI at a later date.

Query for book(s) named "Harry Potter"

Clearly you can't cache this one. Do a scan with a filterexpression on title

 var params = {
  TableName: "Books",
  ExpressionAttributeValues: {
   ":a": {
     S: "Harry Potter"
    }
  }, 
  FilterExpression: "title CONTAINS :a"
 };
 dynamodb.scan(params, function(err, data) {
   if (err) console.log(err, err.stack); // an error occurred
   else     console.log(data);           // successful response
 });

You can checkout the condition operators here

Get all books with ISBN 1, 2, 3, 4, or 9

For this one, do a GetItem on each individual ISBN and add it into a set. The query below gets one book. You would put this in a loop and iterate through the set of ISBNs you want to get.

 var params = {
  Key: {
   "ISBN": {
     S: "1"
    }
  }, 
  TableName: "Books"
 };
 dynamodb.getItem(params, function(err, data) {
   if (err) console.log(err, err.stack); // an error occurred
   else     console.log(data);           // successful response
 });

Question 1

Yes, if you store the genre as a string against each item, and you change the genre name, you would have to update each item. Or as an alternative you would have to update the genre on the item before presenting it to the user.

If you expect to change genre names, the idea of using genre_id mappings seems like a good one. Just have a table of genre names and ids, load it when your application starts and keep it in application memory. You might need an admin function to reload the genre mappings table.

Keeping application parameters in a database is a well used design.

Question 3

Absolutely, have a User table in DynamoDB. That's the way I do it in my application which uses Cognito. I store a minimum set of fields in Cognito relating to user registration, then I have lots of application specific data in DynamoDB in a user table.

Question 4

Regarding graph schemas, I would check out this articles by AWS. Not too sure if that's of help.

like image 89
F_SO_K Avatar answered Oct 15 '22 10:10

F_SO_K