Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to deal with referencing of separately stored objects in document DBs like Mongo?

Tags:

c#

mongodb

This problem is easily solved in ORMs like Entity Framework or NHibernate, but I do not see any ready-made solution in c# driver for MongoDb. Let's say I have collection of objects of type A referencing objects type B which I need to store in separate collection, so that once specific object B is changed, all A referencing it need to be aware of the change. In other words, I need this object relation being normalized. In the same time I need B being referenced by A inside the class, not by Id, but by type reference like shown below:

public class A
{
   public B RefB { get; set; }
}

Do I have to handle all this referencing consistency on my own? If so, which approaches is the best to use? Do I have to keep both B's Id and B reference in the class and somehow take care of sync their values like that:

public class A
{
    // Need to implement reference consistency as well
    public int RefBId { get; set; }

    private B _refB;
    [BsonIgnore]
    public B RefB
    {
        get { return _refB; }
        set { _refB = value; RefBId = _refB.Id }
    }
}

I know somebody may say relational database meets this case the best, I know, but I really have to use document Db like MongoDb, it solves many problems, and in most cases I need to store objects denormalized for my project, however sometimes we might need mixed design inside single storage.

like image 369
YMC Avatar asked Sep 26 '13 16:09

YMC


People also ask

How do I reference another document in MongoDB?

MongoDB applications use one of two methods to relate documents: Manual references save the _id field of one document in another document as a reference. Your application runs a second query to return the related data. These references are simple and sufficient for most use cases.

Which of the following methods can be used in MongoDB for relating documents?

In MongoDB, you can create a relationship using the following methods: Embedded Relationships. Documented Reference Relationships.

Is it more efficient to embed or reference related data?

Embedded documents are an efficient and clean way to store related data, especially data that's regularly accessed together. In general, when designing schemas for MongoDB, you should prefer embedding by default, and use references and application-side or database-side joins only when they're worthwhile.

When should be embedded in document with another in MongoDB?

When should we use embedded documents in MongoDB? Without knowing exactly how your applications will interact with the data, the following answer is only a general guidelines or approach : Favour the embedding, unless there is a reason not to. When the relationship is one to few (not many, not unlimited).


2 Answers

This is mostly an architectural concern, and it probably depends on personal taste a bit. I'll try to examine the pros and cons (actually only the cons, this is quite opinionated):

On the database level, MongoDB offers no tools to enforce referential integrity, so yes, you have to do this yourself. I suggest you use database objects that look like this:

public class DBObject 
{
    public ObjectId Id {get;set;}
}

public class Department : DBObject 
{
  // ...
}

public class EmployeeDB : DBObject
{
    public ObjectId DepartmentId {get;set;}
}

I suggest to use plain DTOs like this at the database level no matter what. If you want additional sugar, put it in a separate layer even if that means a bit of copying. Logic in the DB objects requires a very good understanding of the way the driver hydrates the object and might require to rely on implementation details.

Now, it's a matter of preference of whether you want to work with more 'intelligent' objects. Indeed, many people like to use strongly-typed auto-activating accessors, e.g.

public class Employee
{
    public Department 
    { get { return /* the department object, magically, from the DB */ } }
}

This pattern comes with a number of challenges:

  • It requires the Employee class, a model class, to be able to hydrate the object from the database. That is tricky, because it needs to have the DB injected or you need a static object for database access which can also be tricky.
  • Accessing the Department looks completely cheap, but in fact, it triggers a database operation, it can be slow, it might fail. This is totally hidden from the caller.
  • In a 1:n relation, things grow a lot more complicated. For instance, would Department also expose a list of Employees? If so, would that really be a list (i.e. once you start reading the first, all employees must be deserialized?) Or is it a lazy MongoCursor?
  • To make matters worse, it is not usually clear what kind of caching should be used. Let's say you get myDepartment.Employee[0].Department.Name. Obviously, this code isn't smart, but imagine there's a call stack with a few specialized methods. They might invoke the code just like that, even if it's more hidden. Now a naive implementation would actually de-serialize the ref'd Department again. That's ugly. On the other hand, caching aggressively is dangerous because you might actually want to re-fetch the object.
  • The worst of all: Updates. So far, the challenges were largely read-only. Now lets say I call employeeJohn.Department.Name = 'PixelPushers' and employeeJohn.Save(). Does that update the Department, or not? If it does, are the changes to john serialized first, or after the changes to dependent objects? What about versioning and locking?
  • Many semantics are hard to implement: employeJohn.Department.Employees.Clear() can be tricky.

Many ORMs use a set of complex patterns to allow these operations, so these problems aren't impossible to work around. But ORMs are typically in the range of 100k to well over 1M lines of code(!), and I doubt you have that kind of time. In a RDBMS, the need to activate related objects and use sth. like an ORM is much more severe, because you can't embed e.g. the list of line items in an invoice, so every 1:n or m:n relation must be represented using a join. That's called the object-relation mismatch.

The idea of document databases, as I understand it, is that you don't need to break your model apart as unnaturally as you have to in a RDBMS. Still, there are the 'object borders'. If you think of your data model as a network of connected nodes, the challenge is to know on which part of the data you are currently working.

Personally, I prefer to not put an abstraction layer on top of this, because that abstraction is leaky, it hides what is really going on from the caller, and it tries to solve every problem with the same hammer.

Part of the idea of NoSQL is that your query patterns must be carefully matched to the data-model, because you can't simply apply the JOIN hammer to any table in sight.

So, my opinion is: stick to a thin layer and perform most of the database operation in a service layer. Move DTOs around instead of designing a complex domain model that breaks apart as soon as you need to add locking, mvcc, cascaded updates, etc.

like image 112
mnemosyn Avatar answered Sep 21 '22 00:09

mnemosyn


In a document database, when you do something like your first example:

public class A
{
   public B RefB { get; set; }
}

You are fully embedding the value of B into the RefB property. In other words, your document looks like this:

[a/1]
{
    AProp: "foo",
    RefB: {
        BProp: "bar"
    }
}

It helps to look at things from a Domain Driven Design (DDD) perspective. This pattern of embedding normally occurs when B is either a "value object" or a "non-aggregate entity" (using DDD terminology).

It can also occur if you are storing a point-in-time snapshot of some other aggregate entity. In that scenario, you don't want to update the values of B if they change, or it would no longer represent that point in time.

The other pattern would be to treat A and B as separate aggregates. If one needs to refer to the other, you specify that with a reference to its ID only.

public class A
{
   public string BId { get; set; }
}

Your documents would then be stored such as:

[a/1]
{
    AProp: "foo",
    BId: "b/2"
}

[b/2]
{
    BProp: "bar",
}

Note: I believe in MongoDB, you would use an ObjectId type. In RavenDB, you would usually use a string, but an int is possible with a bit of minor adjustment. Other document databases may allow other types.

The part that doesn't work well in document databases is how you showed in your second example A keeping a reference to B without keeping it as part of the document. This pattern may work in ORMs like Entity Framework or NHibernate, but it tends to be implemented via virtual properties and proxy classes. Those don't hold up well in a document database environment.

So if they are separate documents, instead of loading A and using a.RefB to get to B, you would just load A and B individually. For example, you might load A, and the use the BId to determine how to load B.

Of course, the question still comes down to whether to embed or to link. That is something you will have to figure out, as it can often be done either way. Typically one way works better than the other for a particular domain concern. But you typically don't do both.

like image 32
Matt Johnson-Pint Avatar answered Sep 19 '22 00:09

Matt Johnson-Pint