We're adopting MongoDB for a new solution and are currently trying to design the most effective data model for our needs are regards relationships between data items.
We've got to hold a three way relationship between users, items and lists. A user can have many items and many lists. A list will have one user and many items. An item can belong to many users and many lists. The latter is especially important - an item can belong to potentially huge numbers of lists: thousands, certainly and potentially tens or hundreds of thousands. Possibly even millions in the future. We need to be able to navigate these relationships in both directions: so, for example, getting all the items on a list or all the lists to which an item belongs. We also need the solution to be generic so that we can add many more types of document and relationships between them if we need to.
So it seems there are two possible solutions to this. The first is for each document in the database to have a "relationships" collection consisting of an array of IDs. So a list document would have a relationships collection for items with the IDs of all the items and a relationship collection with a single ID for the user. In this model these arrays will become massive when an item belongs to many, many users or many, many lists.
The second model requires a new type of document, a "relationship" document that stores the IDs of each partner and the relationship name. This is storing more data overall and so will impact disc space. It also looks like an "unnatural" way to approach this problem in NoSQL.
Performance-wise, space-wise, architecture-wise, which is better and why?
Cheers, Matt
It depends on your access patterns.
Embedded id array is better for reading. With one quick read you get ids of all related objects and can now go and fetch them. But if your update rate is high, you'll have some troubles, as mongodb will have to copy the same (already big) object over and over as it outgrows its disk boundaries.
But this solution is really bad for writes. Imagine an item that belongs to a couple of million lists. You decide to delete it. Now you have to walk all those lists and pull this item's id from their reference array. it's exciting, isn't it?
Storing references as separate documents is good for writes. Adding, editing and removing of new references is pretty fast. But this solution takes more disk space and, more importantly, precious RAM. Also reads are not as fast, especially if you have many references.
Given your numbers ("probably even millions in the future") I'd go with this solution. You can always throw in some hardware to accelerate queries. Scaling writes is traditionally the hardest part and in this solution writes are fast and shardable.
I'd agree with Sergio regarding data access patterns being key here.
I'd also add the additional possible solution of storing a fourth document type with three properties- a reference to each of user, list, and item. That collection can be indexed for fast access on all 3 fields, unique indexed on all fields to prevent duplicates, and allows for fast inserts and deletes.
Ultimately you are not storing much more data this way, because if you need to look up the relationship from both sides ("What items in what lists does this user have?" and "What users have this item in their lists?") you need to duplicate references anyway.
It feels relational, but sometimes that is the best solution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With