Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Indexing notifications table in DynamoDB

I am going to implement a notification system, and I am trying to figure out a good way to store notifications within a database. I have a web application that uses a PostgreSQL database, but a relational database does not seem ideal for this use case; I want to support various types of notifications, each including different data, though a subset of the data is common for all types of notifications. Therefore I was thinking that a NoSQL database is probably better than trying to normalize a schema in a relational database, as this would be quite tricky.

My application is hosted in Amazon Web Services (AWS), and I have been looking a bit at DynamoDB for storing the notifications. This is because it is managed, so I do not have to deal with the operations of it. Ideally, I'd like to have used MongoDB, but I'd really prefer not having to deal with the operations of the database myself. I have been trying to come up with a way to do what I want in DynamoDB, but I have been struggling, and therefore I have a few questions.

Suppose that I want to store the following data for each notification:

  • An ID
  • User ID of the receiver of the notification
  • Notification type
  • Timestamp
  • Whether or not it has been read/seen
  • Meta data about the notification/event (no querying necessary for this)

Now, I would like to be able to query for the most recent X notifications for a given user. Also, in another query, I'd like to fetch the number of unread notifications for a particular user. I am trying to figure out a way that I can index my table to be able to do this efficiently.

I can rule out simply having a hash primary key, as I would not be doing lookups by simply a hash key. I don't know if a "hash and range primary key" would help me here, as I don't know which attribute to put as the range key. Could I have a unique notification ID as the hash key and the user ID as the range key? Would that allow me to do lookups only by the range key, i.e. without providing the hash key? Then perhaps a secondary index could help me to sort by the timestamp, if this is even possible.

I also looked at global secondary indexes, but the problem with these are that when querying the index, DynamoDB can only return attributes that are projected into the index - and since I would want all attributes to be returned, then I would effectively have to duplicate all of my data, which seems rather ridiculous.

How can I index my notifications table to support my use case? Is it even possible, or do you have any other recommendations?

like image 992
ba0708 Avatar asked Dec 25 '22 20:12

ba0708


1 Answers

Motivation Note: When using a Cloud Storage like DynamoDB we have to be aware of the Storage Model because that will directly impact your performance, scalability, and financial costs. It is different than working with a local database because you pay not only for the data that you store but also the operations that you perform against the data. Deleting a record is a WRITE operation for example, so if you don't have an efficient plan for clean up (and your case being Time Series Data specially needs one), you will pay the price. Your Data Model will not show problems when dealing with small data volume but can definitely ruin your plans when you need to scale. That being said, decisions like creating (or not) an index, defining proper attributes for your keys, creating table segmentation, and etc will make the entire difference down the road. Choosing DynamoDB (or more generically speaking, a Key-Value store) as any other architectural decision comes with a trade-off, you need to clearly understand certain concepts about the Storage Model to be able to use the tool efficiently, choosing the right keys is indeed important but only the tip of the iceberg. For example, if you overlook the fact that you are dealing with Time Series Data, no matter what primary keys or index you define, your provisioned throughput will not be optimized because it is spread throughout your entire table (and its partitions) and NOT ONLY THE DATA THAT IS FREQUENTLY ACCESSED, meaning that unused data is directly impacting your throughput just because it is part of the same table. This leads to cases where the ProvisionedThroughputExceededException is thrown "unexpectedly" when you know for sure that your provisioned throughput should be enough for your demand, however, the TABLE PARTITION that is being unevenly accessed has reached its limits (more details here).

The post below has more details, but I wanted to give you some motivation to read through it and understand that although you can certainly find an easier solution for now, it might mean starting from the scratch in the near future when you hit a wall (the "wall" might come as high financial costs, limitations on performance and scalability, or a combination of all).

Q: Could I have a unique notification ID as the hash key and the user ID as the range key? Would that allow me to do lookups only by the range key, i.e. without providing the hash key?

A: DynamoDB is a Key-Value storage meaning that the most efficient queries use the entire Key (Hash or Hash-Range). Using the Scan operation to actually perform a query just because you don't have your Key is definitely a sign of deficiency in your Data Model in regards to your requirements. There are a few things to consider and many options to avoid this problem (more details below).

Now before moving on, I would suggest you reading this quick post to clearly understand the difference between Hash Key and Hash+Range Key:

DynamoDB: When to use what PK type?

Your case is a typical Time Series Data scenario where your records become obsolete as the time goes by. There are two main factors you need to be careful about:

  • Make sure your tables have even access patterns

If you put all your notifications in a single table and the most recent ones are accessed more frequently, your provisioned throughput will not be used efficiently. You should group the most accessed items in a single table so the provisioned throughput can be properly adjusted for the required access. Additionally, make sure you properly define a Hash Key that will allow even distribution of your data across multiple partitions.

  • The obsolete data is deleted with the most efficient way (effort, performance and cost wise)

The documentation suggests segmenting the data in different tables so you can delete or backup the entire table once the records become obsolete (see more details below).

Here is the section from the documentation that explains best practices related to Time Series Data:

Understand Access Patterns for Time Series Data

For each table that you create, you specify the throughput requirements. DynamoDB allocates and reserves resources to handle your throughput requirements with sustained low latency. When you design your application and tables, you should consider your application's access pattern to make the most efficient use of your table's resources.

Suppose you design a table to track customer behavior on your site, such as URLs that they click. You might design the table with hash and range type primary key with Customer ID as the hash attribute and date/time as the range attribute. In this application, customer data grows indefinitely over time; however, the applications might show uneven access pattern across all the items in the table where the latest customer data is more relevant and your application might access the latest items more frequently and as time passes these items are less accessed, eventually the older items are rarely accessed. If this is a known access pattern, you could take it into consideration when designing your table schema. Instead of storing all items in a single table, you could use multiple tables to store these items. For example, you could create tables to store monthly or weekly data. For the table storing data from the latest month or week, where data access rate is high, request higher throughput and for tables storing older data, you could dial down the throughput and save on resources.

You can save on resources by storing "hot" items in one table with higher throughput settings, and "cold" items in another table with lower throughput settings. You can remove old items by simply deleting the tables. You can optionally backup these tables to other storage options such as Amazon Simple Storage Service (Amazon S3). Deleting an entire table is significantly more efficient than removing items one-by-one, which essentially doubles the write throughput as you do as many delete operations as put operations.

Source:

http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.TimeSeriesDataAccessPatterns

For example, You could have your tables segmented by month:

Notifications_April, Notifications_May, etc

Q: I would like to be able to query for the most recent X notifications for a given user.

A: I would suggest using the Query operation and querying using only the Hash Key (UserId) having the Range Key to sort the notifications by the Timestamp (Date and Time).

Hash Key: UserId
Range Key: Timestamp

Note: A better solution would be the Hash Key to not only have the UserId but also another concatenated information that you could calculate before querying to make sure your Hash Key grants you even access patterns to your data. For example, you can start to have hot partitions if notifications from specific users are more accessed than others... having an additional information in the Hash Key would mitigate this risk.

Q: I'd like to fetch the number of unread notifications for a particular user.

A: Create a Global Secondary Index as a Sparse Index having the UserId as the Hash Key and Unread as the Range Key.

Example:

Index Name: Notifications_April_Unread
Hash Key: UserId
Range Key : Unuread

When you query this index by Hash Key (UserId) you would automatically have all unread notifications with no unnecessary scans through notifications which are not relevant to this case. Keep in mind that the original Primary Key from the table is automatically projected into the index, so in case you need to get more information about the notification you can always resort to those attributes to perform a GetItem or BatchGetItem on the original table.

Note: You can explore the idea of using different attributes other than the 'Unread' flag, the important thing is to keep in mind that a Sparse Index can help you on this Use Case (more details below).

Detailed Explanation:

I would have a sparse index to make sure that you can query a reduced dataset to do the count. In your case you can have an attribute "unread" to flag if the notification was read or not, and use that attribute to create the Sparse Index. When the user reads the notification you simply remove that attribute from the notification so it doesn't show up in the index anymore. Here are some guidelines from the documentation that clearly apply to your scenario:

Take Advantage of Sparse Indexes

For any item in a table, DynamoDB will only write a corresponding index entry if the index range key attribute value is present in the item. If the range key attribute does not appear in every table item, the index is said to be sparse. [...]

To track open orders, you can create an index on CustomerId (hash) and IsOpen (range). Only those orders in the table with IsOpen defined will appear in the index. Your application can then quickly and efficiently find the orders that are still open by querying the index. If you had thousands of orders, for example, but only a small number that are open, the application can query the index and return the OrderId of each open order. Your application will perform significantly fewer reads than it would take to scan the entire CustomerOrders table. [...]

Instead of writing an arbitrary value into the IsOpen attribute, you can use a different attribute that will result in a useful sort order in the index. To do this, you can create an OrderOpenDate attribute and set it to the date on which the order was placed (and still delete the attribute once the order is fulfilled), and create the OpenOrders index with the schema CustomerId (hash) and OrderOpenDate (range). This way when you query your index, the items will be returned in a more useful sort order.[...]

Such a query can be very efficient, because the number of items in the index will be significantly fewer than the number of items in the table. In addition, the fewer table attributes you project into the index, the fewer read capacity units you will consume from the index.

Source: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForGSI.html#GuidelinesForGSI.SparseIndexes

Find below some references to the operations that you will need to programmatically create and delete tables:

Create Table http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_CreateTable.html

Delete Table http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_DeleteTable.html

like image 150
b-s-d Avatar answered Jan 05 '23 16:01

b-s-d