I plan on using AWS Cognito for user auth, DynamoDB for persistence and AppSync (and a lot of Mobile Hub) to power the API - a Book Review site.
I'm having a hard time determining which field should be my hash key and which should be my sort key, and which LSI/GSI I should create.
I have a list of Books with details like so:
type Book {
isbn: Int!
year: Int!
title: String!
description: String
front_cover_photo_url: String
genre_ids: [Int]
count_thumbs: Int
us_release_date: String
upcoming_release: Boolean
currently_featured_in_book_stores: Boolean
best_seller: Boolean
reviews: [Review]
}
I also have a review record each time a user writes a review about a book.
type Review {
isbn: Int!
id: ID!
created_at: String!
# The user that submitted the review
user_id: String!
# The number of thumbs out of 5
thumbs: Int!
# Comments on the review
comments: String!
}
Books, in my case, can have multiple genres - e.g."Fantasy" and "Drama". Books also have reviews by Users, whose data is stored in Cognito. We will display the reviews in reverse chronological order next to every book.
QUESTION 1: If I denormalize and use Drama
as a genre instead of Genre ID 2
, then what if I need to rename the genre later to Dramatic
... wouldn't I need to update every item?
I need to be able to answer, at a minimum:
currently_featured_in_book_stores
== True]upcoming_release
== True]count_thumbs
DESC]genre_ids
contains 123
or "Comedy" depending on answer to Q1]title
LIKE '%Harry Potter%']isbn
IN [1,2,3,4,9] ]QUESTION 2: What's the best way to structure the book data in DynamoDB, and which hash/sort/LSI/GSI would you use?
Since I'm using Cognito, the user profile data is stored outside of DynamoDB.
QUESTION 3: Should I have a User
table in DynamoDB and dual write new registrations, so I can use AppSync to populate the review's details when showing their review? If not, how would I get the user's username/first name/last name when populating the book review details?
QUESTION 4: Since we've gone this far, any suggestions for the graphql schema?
You can not query only using a Sort Key. You need to specify a partition key to perform query operations. Else, you need to create a global secondary index or perform a scan operation.
In a DynamoDB table, the combined partition key value and sort key value for each item must be unique. However, in a local secondary index, the sort key value does not need to be unique for a given partition key value.
In an Amazon DynamoDB table, the primary key that uniquely identifies each item in the table can be composed not only of a partition key, but also of a sort key. Well-designed sort keys have two key benefits: They gather related information together in one place where it can be queried efficiently.
Generally in DynamoDB you can create Local Secondary Indexes if you need alternative sort key: To give your application a choice of sort keys, you can create one or more local secondary indexes on an Amazon DynamoDB table and issue Query or Scan requests against these indexes.
I would encourage you to read this answer. I have previously written to provide some general background on choosing keys. You should also open the links from that answer, which provide most of the key information AWS make available on the subject.
Before providing an answer I think I should also give the caveat that data architecture typically takes into account lots of factors. You've put some really good information in the question but inevitably there is not enough to provide a definitive 'best' solution. And indeed even with more information you would get different opinions.
That said, here is what I would be thinking about doing in your case. I would be looking at creating a table called Books and a table called BookReviews.
Table: Books
Partition Key: ISBN
Table: BookReviews
Partition Key: ISBN
Sort Key: BookReview-id
I would not be looking to create any GSIs or LSIs.
Most of your queries involve finding 'all books' and ordering them in some way. These lists do not sound time sensitive. For example when a user asks for the most popular 100 books do they need to know the most popular books, including every vote counted up until the last second? I doubt it. Additionally are these lists specific to individual users? It doesn't sound like it.
My general tip is this; store your raw data in DynamoDB, and update it in real time. Create your common lists of books and update them once in a while (perhaps daily), store these lists in a cache. Optionally you could store these lists in DynamoDB in separate tables and query them in the event your cache is destroyed.
Get all books currently featured in book stores
var params = {
TableName: "Books",
ExpressionAttributeValues: {
":a": {
BOOL: true
}
},
FilterExpression: "currently_featured_in_book_stores = :a"
};
dynamodb.scan(params, function(err, data) {
if (err) console.log(err, err.stack); // an error occurred
else console.log(data); // successful response
});
This operation will retrieve all books that are currently featured in book stores. It uses a scan. If you not already familiar with scan, query and getItem you should definitely spend some time reading about them.
A scan evaluates every item in a table, for this reason scans sometimes don't scale well on large tables and can be expensive if you are only retrieving a few items. A query uses the partition key to return a set of items and is therefore typically fast and efficient. You can use a sort key in a query to quickly return a range of items from within a partition. GetItem uses the unique primary key and is very efficient.
If your table had 100 items, ANY scan you perform will cost 100 RCUs. If you perform a query, and only 2 items are in the queried partition, it would cost you 2 RCUs.
If a significant proportion of items in the Books table have currently_featured_in_book_stores=true, I would do a scan. If only a small number of items in the table have currently_featured_in_book_stores=true AND this is a very frequent query, you could consider creating a GSI on the Books table with partition key of currently_featured_in_book_stores and sort key of ISBN.
Imagine your books table has 100 books, and 50 have currently_featured_in_book_stores=true. Doing a scan costs 100 RCUs and won't cost much more than a query. Now imagine only one book has currently_featured_in_book_stores=true, perfoming a scan would cost 100 RCUs but a query would only cost 1 RCU. However you should think hard before adding GSIs, they do not share throughput with the base table, and you have to purchase RCUs separately for your GSI. If you under provision a GSI it can end up being slower than a scan on a well provisioned base table.
A boolean value is a bad partition key and I would go for a scan here. That said if you created the GSI above your query would look like this:
var params = {
TableName: "Books",
IndexName: "Index_Books_In_Stores",
ExpressionAttributeValues: {
":v1": {
BOOL: true
}
},
KeyConditionExpression: "currently_featured_in_book_stores = :v1"
};
dynamodb.query(params, function(err, data) {
if (err) console.log(err, err.stack); // an error occurred
else console.log(data); // successful response
});
Get all books that are upcoming
All of the above still applies. I would do a scan like this
var params = {
TableName: "Books",
ExpressionAttributeValues: {
":a": {
BOOL: true
}
},
FilterExpression: "upcoming_release = :a"
};
dynamodb.scan(params, function(err, data) {
if (err) console.log(err, err.stack); // an error occurred
else console.log(data); // successful response
});
I would do this scan infrequently and cache the results in a temporary store (i.e. in application memory).
Get all books sorted by most thumbs
The important thing here is the 'Get all books...'. That tells you right away that a scan is probably going to the best approach. You can think of a query as a scan that only looks at one partition. You don't want to look at a partition of books, you want ALL the books, so a scan is the way to go.
The only way DynamoDB will return sorted items is if you perform a query on a table or index that has a sort key. In this case the items would automatically be returned in sorted order based on the sort key. So for this search, you just need to do a scan to get all the books, and then sort them by your chosen attribute (thumbs) client side. The scan simply returns all books and looks like this.
var params = {
TableName: "Books"
};
dynamodb.scan(params, function(err, data) {
if (err) console.log(err, err.stack); // an error occurred
else console.log(data); // successful response
});
Again, I would do this scan very infrequently and cache the top books. You can order your cache and just retrieve the number of items you need, perhaps the top 10, 100 or 1000. If the user carried on paging beyond the scope of the cache, you might need to do a new scan. I think more likely you would just limit the number of items and stop the user paging any further.
Get all books that are in genre "Comedy"
Again, most likely I would do a scan infrequently and cache the list. You could consider adding a GSI with partition key genre and sort key ISBN. Personally I would start with the scan and cache approach and see how you get on. You can always add the GSI at a later date.
Query for book(s) named "Harry Potter"
Clearly you can't cache this one. Do a scan with a filterexpression on title
var params = {
TableName: "Books",
ExpressionAttributeValues: {
":a": {
S: "Harry Potter"
}
},
FilterExpression: "title CONTAINS :a"
};
dynamodb.scan(params, function(err, data) {
if (err) console.log(err, err.stack); // an error occurred
else console.log(data); // successful response
});
You can checkout the condition operators here
Get all books with ISBN 1, 2, 3, 4, or 9
For this one, do a GetItem on each individual ISBN and add it into a set. The query below gets one book. You would put this in a loop and iterate through the set of ISBNs you want to get.
var params = {
Key: {
"ISBN": {
S: "1"
}
},
TableName: "Books"
};
dynamodb.getItem(params, function(err, data) {
if (err) console.log(err, err.stack); // an error occurred
else console.log(data); // successful response
});
Yes, if you store the genre as a string against each item, and you change the genre name, you would have to update each item. Or as an alternative you would have to update the genre on the item before presenting it to the user.
If you expect to change genre names, the idea of using genre_id mappings seems like a good one. Just have a table of genre names and ids, load it when your application starts and keep it in application memory. You might need an admin function to reload the genre mappings table.
Keeping application parameters in a database is a well used design.
Question 3
Absolutely, have a User table in DynamoDB. That's the way I do it in my application which uses Cognito. I store a minimum set of fields in Cognito relating to user registration, then I have lots of application specific data in DynamoDB in a user table.
Regarding graph schemas, I would check out this articles by AWS. Not too sure if that's of help.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With