Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mongo DB relations between documents in different collections

I'm not yet ready to let this go, which is why I re-thought the problem and edited the Q (original below).


I am using mongoDB for a weekend project and it requires some relations in the DB, which is what the misery is all about:

I have three collections:

Users
Lists
Texts

A user can have texts and lists - lists 'contain' texts. Texts can be in multiple lists.

I decided to go with separate collections (not embeds) because child documents don't always appear in context of their parent (eg. all texts, without being in a list).

So what needs to be done is reference the texts that belong into certain lists with exactly those lists. There can be unlimited lists and texts, though lists will be less in comparison.

In contrast to what I first thought of, I could also put the reference in every single text-document and not all text-ids in the list-documents. It would actually make a difference, because I could get away with one query to find every snippet in a list. Could even index that reference.

var TextSchema = new Schema({
      _id: Number,
      name: String,
      inListID: { type : Array , "default" : [] },
      [...]

It is also rather seldom the case that texts will be in MANY lists, so the array would not really explode. The question kind of remains though, is there a chance this scales or actually a better way of implementing it with mongoDB? Would it help to limit the amount of lists a text can be in (probably)? Is there a recipe for few:many relations?

It would even be awesome to get references to projects where this has been done and how it was implemented (few:many relations). I can't believe everybody shies away from mongo DB as soon as some relations are needed.



Original Question

I'll break it down in two problems I see so far: 1) Let's assume a list consists of 5 texts. How do I reference the texts contained in a list? Just open an array and store the text's _ids in there? Seems like those arrays might grow to the moon and back, slowing the app down? On the other hand texts need to be available without a list, so embedding is not really an option. What if I want to get all texts of a list that contains 100 texts.. sounds like two queries and an array with 100 fields :-/. So is this way of referencing the proper way to do it?

var ListSchema = new Schema({
  _id: Number,
  name: String,
  textids: { type : Array , "default" : [] },
  [...]

Problem 2) I see with this approach is cleaning the references if a text is deleted. Its reference will still be in every list that contained the text and I wouldn't want to iterate through all the lists to clean out those dead references. Or would I? Is there a smart way to solve this? Just making the texts hold the reference (in which list they are) just moves the problem around, so that's not an option.

I guess I'm not the first with this sort of problem but I was also unable to find a definitive answer on how to do it 'right'.

I'm also interested in general thoughts on best-practice for this sort of referencing (many-to-many?) and especially scalability/performance.

like image 932
BenMann Avatar asked May 26 '15 13:05

BenMann


People also ask

How does MongoDB manage relationship between documents?

In MongoDB, a relationship represents how different types of documents are logically related to each other. Relationships like one-to-one, one-to-many, etc., can be represented by using two different models: Embedded document model.

How does MongoDB relate two collections?

For performing MongoDB Join two collections, you must use the $lookup operator. It is defined as a stage that executes a left outer join with another collection and aids in filtering data from joined documents. For example, if a user requires all grades from all students, then the below query can be written: Students.

Can a MongoDB database have multiple collections?

Yes, you can have multiple collections within a database in MongoDB.

Does MongoDB support query join between collections?

6. Does MongoDB supports query joins between collections ? No MongoDB doesnot supports query joins between collections.


1 Answers

Relations are usually not a big problem, though certain operations involving relations might be. That depends largely on the problem you're trying to solve, and very strongly on the cardinality of the result set and the selectivity of the keys.

I have written a simple testbed that generates data following a typical long-tail distribution to play with. It turns out that MongoDB is usually better at relations than people believe.

After all, there are only three differences to relational databases:

  • Foreign key constraints: You have to manage these yourself, so there's some risk for dead links
  • Transaction isolation: Since there are no multi-document transactions, there's some likelihood for creating invalid foreign key constraints even if the code is correct (in the sense that it never tries to create a dead link), but merely interrupted at runtime. Also, it is hard to check for dead links because you could be observing a race condition
  • Joins: MongoDB doesn't support joins, though a manual subquery with $in does scale well up to several thousand items in the $in-clause, provided the reference values are indexed, of course

Iff you need to perform large joins, i.e. if your queries are truly relational and you need large amount of the data joined accordingly, MongoDB is probably not a good fit. However, many joins required in relational databases aren't truly relational, they are required because you had to split up your object to multiple tables, for instance because it contains a list.

An example of a 'truly' relational query could be "Find me all customers who bought products that got >4 star reviews by customers that ranked high in turnover in June". Unless you have a very specialized schema that essentially was built to support this query, you'll most likely need to find all the orders, group them by customer ids, take the top n results, use these to query ratings using $in and use another $in to find the actual customers. Still, if you can limit yourself to the top, say 10k customers of June, this is three round-trips and some fast $in queries.

That will probably be in the range of 10-30ms on typical cloud hardware as long as your queries are supported by indexes in RAM and the network isn't completely congested. In this example, things get messy if the data is too sparse, i.e. the top 10k users hardly wrote >4 star reviews, which would force you to write program logic that is smart enough to keep iterating the first step which is both complicated and slow, but if that is such an important scenario, there is probably a better suited data structure anyway.

like image 164
mnemosyn Avatar answered Sep 23 '22 07:09

mnemosyn