Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to model one-to-one, one-to-many and many-to-many relationships in DynamoDB

What is the best way to model these relationships in DynamoDB?

  • One-to-one relationships
  • One-to-many relationships
  • Many-to-many relationships
like image 999
F_SO_K Avatar asked Mar 13 '19 22:03

F_SO_K


People also ask

What is the data model of DynamoDB?

A DynamoDB table design corresponds to the relational order entry schema that is shown in Relational modeling. It follows the Adjacency list design pattern, which is a common way to represent relational data structures in DynamoDB.

Can DynamoDB be used for relational data?

DynamoDB Relational Modeling Process. Data is stored in a normalized relational structure in traditional relational database management systems (RDBMS). Hierarchical data structures are reduced to a set of common elements that are stored across several tables using this structure.

Can DynamoDB do joins?

Therefore, DynamoDB does not allow "join" queries. However, it is possible to perform joins on DynamoDB tables via external services such as Apache Hive and Amazon EMR. But it's important to note that DynamoDB does not natively support joins.

Can we have multiple GSI in DynamoDB?

You can create 20 global secondary indexes for a DynamoDB table as of the time this page was written. Sometimes, though, your application might need to support multiple access patterns and exceed the current limit of global secondary indexes per table.


1 Answers

I've seen variations of this question so many times I thought I would write a Q&A.

Key DynamoDB Basics

Before reading this you should understand:

  • Every DynamoDB table has a unique primary key
  • A primary key must be composed of a partition key, and can have optionally have a sort key. A primary key with both a partition key and a sort key is a composite key.
  • A GetItem request returns one and only one item using its unique primary key.
  • A Query does a fast lookup and must specify one and only one partition key. It can return multiple items.
  • A Scan evaluates every item in a table and may return a subset based on filter parameters. Scans are the correct choice in some circumstances, but can be slow and costly if used incorrectly.
  • A Global Secondary Index (GSI) has a different partition key to the base table. Think of it a bit like having two tables (the base table and GSI) that are kept in sync. Depending on use, a GSI may double the cost of your base table.
  • A Local Secondary Index (LSI) has the same partition key as the base table but a different sort key. Think of it as a different way to sort your base table data, but only within the partition keys. An LSI does not cost you anything.

One-to-one


We can model passports and people to demonstrate this relationship. One passport can only have one owner, and one person can only have one passport.

The approach is very simple. We have two tables, and one of those tables should have a foreign key.

Passport table:

Partition Key: PassportId

╔════════════╦═══════╦════════════╗ ║ PassportId ║ Pages ║   Issued   ║ ╠════════════╬═══════╬════════════╣ ║ P1         ║    15 ║ 11/03/2009 ║ ║ P2         ║    18 ║ 09/02/2018 ║ ╚════════════╩═══════╩════════════╝ 

Passport holder table:

Partition Key: PersonId

╔══════════╦════════════╦══════╗ ║ PersonId ║ PassportId ║ Name ║ ╠══════════╬════════════╬══════╣ ║ 123      ║ P1         ║ Jane ║ ║ 234      ║ P2         ║ Paul ║ ╚══════════╩════════════╩══════╝ 

Notice that PersonId did not appear in the passport table. If we did that, we would have two places with the same information (which passports belong to which person). This would lead to extra data updates and potentially some data quality issues if the tables did not agree on who owns which passport.

However, we are missing a use case. We can easily look a person up by their PersonId, and find which passport they have. But what if we have a PassportId and we need to find who owns it? In the current model we would need to perform a Scan on the Passport holder table. If this is a regular use case, we wouldn't want to use a Scan. To support a GetItem we can simply add a GSI to the Passport holder table:

Passport holder table GSI:

Partition Key: PassportId

╔════════════╦══════════╦══════╗ ║ PassportId ║ PersonId ║ Name ║ ╠════════════╬══════════╬══════╣ ║ P1         ║ 123      ║ Jane ║ ║ P2         ║ 234      ║ Paul ║ ╚════════════╩══════════╩══════╝ 

Now we can look up relationships using PassportId or PersonId very quickly and cheaply.

There are other options for modelling this. For example you could have a 'plain' Passport table and Person table with no foreign keys, then have a third auxiliary table that simple maps PassortIds and PersonIds together. I don't think that's cleanest design in this case, but if you prefer it, there is nothing wrong with that approach. Note that their is an example of an auxiliary relationship table in the many-to-many relationship section.


One-to-many


We can model pets and owners to demonstrate this relationship. Pets can only have one owner, but owners can have many pets.

The model looks very similar to the one-to-one model, so I will just focus on this differences.

Pet table:

Partition Key: PetId

╔═══════╦═════════╦════════╗ ║ PetId ║ OwnerId ║ Type   ║ ╠═══════╬═════════╬════════╣ ║ P1    ║ O1      ║ Dog    ║ ║ P2    ║ O1      ║ Cat    ║ ║ P3    ║ O2      ║ Rabbit ║ ╚═══════╩═════════╩════════╝ 

Owner table:

Partition Key: OwnerId

╔═════════╦════════╗ ║ OwnerId ║ Name   ║ ╠═════════╬════════╣ ║ O1      ║ Angela ║ ║ O2      ║ David  ║ ╚═════════╩════════╝ 

We put the foreign key in the many table. If we did it the other way around, and put PetIds in the Owner table, one Owner Item would have to have a set of PetIds, and that would get complicated to manage.

If we want to find out the Owner for a Pet, its very easy. We can do a GetItem to return the Pet Item, and it tells us who the owner is. But the other way around is harder - if we have an OwnerId, which Pets do they own? To save us have to do a Scan on the Pet table, we instead add a GSI to the Pet table.

Pet table GSI

Partition Key: OwnerId

╔═════════╦═══════╦════════╗ ║ OwnerId ║ PetId ║ Type   ║ ╠═════════╬═══════╬════════╣ ║ O1      ║ P1    ║ Dog    ║ ║ O1      ║ P2    ║ Cat    ║ ║ O2      ║ P3    ║ Rabbit ║ ╚═════════╩═══════╩════════╝ 

If we have an OwnerId and we want to find their Pets, we can perform a Query on the Pet table GSI. For example a Query on Owner O1 will return the items with PetId P1 and P2.

You may notice something interesting here. A primary key must be unique for a table. This is true only for the base table. A GSI primary key, in this case just the GSI partition key, does not have to be unique.

In a DynamoDB table, each key value must be unique. However, the key values in a global secondary index do not need to be unique

On a side note, a GSI does not need to project all of the same attributes as the base table. If you are using the GSI for lookups only, you may wish to only project the GSI key attributes.


Many-to-many


There are three main ways to model a many-to-many relationship in DynamoDB. Each have strengths and weaknesses.

We can use the example of Doctors and Patients to model this relationship. A Doctor can have many patients and a patient can have many Doctors.


Many-to-many - Option 1 - Auxiliary table


Generally, this is my preferred approach, which is why it goes first. The idea is to create 'plain' base tables with no relationship references. The relationship references then go in auxiliary tables (one auxiliary table per relationship type - in this case just Doctors-Patients).

Doctor table:

Partition Key: DoctorId

╔══════════╦═══════╗ ║ DoctorId ║ Name  ║ ╠══════════╬═══════╣ ║ D1       ║ Anita ║ ║ D2       ║ Mary  ║ ║ D3       ║ Paul  ║ ╚══════════╩═══════╝ 

Patient table

Partition Key: PatientId

╔═══════════╦═════════╦════════════╗ ║ PatientId ║ Name    ║ Illness    ║ ╠═══════════╬═════════╬════════════╣ ║ P1        ║ Barry   ║ Headache   ║ ║ P2        ║ Cathryn ║ Itchy eyes ║ ║ P3        ║ Zoe     ║ Munchausen ║ ╚═══════════╩═════════╩════════════╝ 

DoctorPatient table (auxiliary table)

Partition Key: DoctorId

Sort Key: PatientId

╔══════════╦═══════════╦══════════════╗ ║ DoctorId ║ PatientId ║ Last Meeting ║ ╠══════════╬═══════════╬══════════════╣ ║ D1       ║ P1        ║ 01/01/2018   ║ ║ D1       ║ P2        ║ 02/01/2018   ║ ║ D2       ║ P2        ║ 03/01/2018   ║ ║ D2       ║ P3        ║ 04/01/2018   ║ ║ D3       ║ P3        ║ 05/01/2018   ║ ╚══════════╩═══════════╩══════════════╝ 

DoctorPatient table GSI

Partition Key: PatientId

Sort Key: DoctorId

╔═══════════╦══════════╦══════════════╗ ║ PatientId ║ DoctorId ║ Last Meeting ║ ╠═══════════╬══════════╬══════════════╣ ║ P1        ║ D1       ║ 01/01/2018   ║ ║ P2        ║ D1       ║ 02/01/2018   ║ ║ P2        ║ D2       ║ 03/01/2018   ║ ║ P3        ║ D2       ║ 04/01/2018   ║ ║ P3        ║ D3       ║ 05/01/2018   ║ ╚═══════════╩══════════╩══════════════╝ 

There are three tables, the DoctorPatient auxiliary table is the interesting one.

The DoctorPatient base table primary key must be unique, so we create a composite key of the DoctorId (partition key) and PatientId (sort key).

We can perform a Query on the DoctorPatient base table using DoctorId to get all patients that a Doctor has.

We can perform a Query on the DoctorPatient GSI using PatientId to get all of the Doctors associated with a Patient.

The strengths of this approach are a clean separation of tables, and the ability to map simple business objects directly to the database. It requires no use of more advanced features such as sets.

It is necessary to co-ordinate some updates, for example if you delete a Patient, you also need to be careful to delete the relationships in the DoctorPatient table. However the chance of introducing data quality issues is low compared to some other approaches.

EDIT: DynamoDB now supports Transactions, allowing you to co-ordinate multiple updates into a single atomic transaction across multiple tables.

A potential weakness of this approach is that it requires 3 tables. If you are provisioning tables with throughput, the more tables there are, the thinner you have to spread your capacity. However with the new on-demand feature, this is not a concern.


Many-to-many - Option 2 - Foreign key sets


This approach uses just two tables.

Doctor table:

Partition Key: DoctorId

╔══════════╦════════════╦═══════╗ ║ DoctorId ║ PatientIds ║ Name  ║ ╠══════════╬════════════╬═══════╣ ║ D1       ║ P1,P2      ║ Anita ║ ║ D2       ║ P2,P3      ║ Mary  ║ ║ D3       ║ P3         ║ Paul  ║ ╚══════════╩════════════╩═══════╝ 

Patient table:

Partition Key: PatientId

╔═══════════╦══════════╦═════════╗ ║ PatientId ║ DoctorIds║  Name   ║ ╠═══════════╬══════════╬═════════╣ ║ P1        ║ D1       ║ Barry   ║ ║ P2        ║ D1,D2    ║ Cathryn ║ ║ P3        ║ D2,D3    ║ Zoe     ║ ╚═══════════╩══════════╩═════════╝ 

This approach involves storing relationships as a set in each table.

To find the Patients for a Doctor, we can use GetItem on the Doctor table to retrieve the Doctor item. Then the PatientIds are stored as a set in a Doctor attribute.

To find the Doctors for a Patient, we can use GetItem on the Patient table to retrieve the Patient item. Then the DoctorIds are stored as a set in a Patient attribute.

The strength of this approach is that there is a direct mapping between business objects and database tables. There are only two tables so if you are using provision throughput capacity, it doesn't need to be spread too thinly.

The major downside to this approach is the potential for data quality issues. If you link a Patient to a Doctor, you have you co-ordinate two updates, one to each table. What happens if one update fails? You data can get out of sync.

Another downside is the use of Sets in both tables. The DynamoDB SDKs are designed to handle Sets, but certain operations can be complicated when Sets are involved.


Many-to-many - Option 3 - Graph Schema


AWS have previously referred to this as the Adjacency List pattern. It is more commonly referred to as a Graph database or a Triple Store.

I have previously answered this question on the AWS Adjancey List Pattern which seems to have helped some people understand it.

And there is a recent presentation by AWS that talks a lot about this pattern here

The approach involves putting all of the data in just one table.

I've just drawn some example rows rather than the whole table:

Partition Key: Key1

Sort Key: Key2

╔═════════╦═════════╦═══════╦═════════════╦══════════════╗ ║ Key1    ║ Key2    ║ Name  ║   illness   ║ Last Meeting ║ ╠═════════╬═════════╬═══════╬═════════════╬══════════════╣ ║ P1      ║ P1      ║ Barry ║ Headache    ║              ║ ║ D1      ║ D1      ║ Anita ║             ║              ║ ║ D1      ║ P1      ║       ║             ║ 01/01/2018   ║ ╚═════════╩═════════╩═══════╩═════════════╩══════════════╝ 

And then a GSI is required that inverts the keys:

Partition Key: Key2

Sort Key: Key1

╔═════════╦═════════╦═══════╦═════════════╦══════════════╗ ║ Key2    ║ Key1    ║ Name  ║   illness   ║ Last Meeting ║ ╠═════════╬═════════╬═══════╬═════════════╬══════════════╣ ║ P1      ║ P1      ║ Barry ║ Headache    ║              ║ ║ D1      ║ D1      ║ Anita ║             ║              ║ ║ P1      ║ D1      ║       ║             ║ 01/01/2018   ║ ╚═════════╩═════════╩═══════╩═════════════╩══════════════╝ 

This model has some strengths in some specific circumstances - it can perform well in highly connected data. If you format your data well, you can achieve extremely fast and scalable models. It is flexible in that you can store any entity or relationship in the table without updating your schema/tables. If you are provisioning throughput capacity it can be efficient as all of the throughput is available to any operation across the application.

This model suffers from some huge downsides if used incorrectly or without serious consideration.

You lose any direct mapping between your business objects and your tables. This almost always results in unreadable spaghetti code. Performing even simple queries can feel very complex. Managing data quality becomes difficult as there is no obvious mapping between the code and the database. Most projects I've seen that use this approach end up writing various utilities, some of which become products in their own right, just to manage the database.

Another minor problem is that every attribute for every item in your model has to exist on one table. This usually results in a table that has hundreds of columns. In itself this is not a problem, but trying to work on a table with that many columns usually throws up simple problems like difficulty in viewing the data.

In short I think AWS have probably released what should have been a useful article in a set of articles, but by failing to introduce other (simpler) concepts for managing many-to-many relationships, they have confused lots of people. So to be clear, the adjacency list pattern can be useful, but its not the only option for modelling many-to-many relationships in DynamoDB. By all means use it if it works for your circumstances such as seriously Big Data, but if not, try one of the simpler models.

like image 65
F_SO_K Avatar answered Oct 02 '22 09:10

F_SO_K