Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can $facet improve $lookup performance

Issue

I recently attended a tech meetup and showed some of my code to a much more experienced developer. He commented that I would run into issues with my pipeline due to $lookup and that I should consider using $facet to fix this.

I don't remember what issue he said I would run into nor how $facet could help fix it. I think it has something to do with the 16mb document limit but this can be resolved by using $unwind after $lookup.

My code (Node.js)

I have a collection of Post documents. Some posts are parent posts, and other posts are comments. A post that is a comment is identified by the fact that its parent property is NOT null.

My goal is to return an array of the most recent parent posts and attach to each one, an int property that is the number of comments it has.

Here is my Post mongoose schema

const postSchema = new mongoose.Schema({
    title: { type: String, required: true, trim: true },
    body: { type: String, required: true, trim: true },
    category: { type: String, required: true, trim: true, lowercase: true },
    timestamp: { type: Date, required: true, default: Date.now },
    parent: { type: mongoose.Schema.Types.ObjectId, ref: 'Post', default: null },
});

Here is my pipeline

const pipeline = [
    { $match: { category: query.category } },
    { $sort: { timestamp: -1 } },
    { $skip: (query.page - 1) * query.count },
    { $limit: query.count },
    {
        $lookup: {
            from: 'posts',
            localField: '_id',
            foreignField: 'parent',
            as: 'comments',
        },
    },
    {
        $addFields: {
            comments: { $size: '$comments' },
            id: '$_id',
        },
    },
    { $project: { _id: 0, __v: 0 } },
];
like image 528
Thiago P Avatar asked May 13 '18 14:05

Thiago P


1 Answers

In short, It cannot. But if anyone told you this then it deserves and explanation to clear up why such a concept is incorrect.

Why NOT $facet

As commented, $facet cannot do anything for you here and was likely a miscommunication about what your query is meant to be doing. If anything, the $facet pipeline stage would cause more problems with the BSON Limit due to the plain fact that the only output of the $facet pipeline stage is a "single document", which means that unless you are actually using it for the intended purpose of "summary results" then you will almost certainly breach this limit in real world conditions.

The biggest reason why it simply does not apply is because your $lookup source is pulling in data from a different collection. The $facet stage only applies to the "same collection", so you cannot have output from one collection in one "facet" and another collection in a different facet. There can only be "pipelines" defined for the same collection on which the .aggregate() is being performed.

$lookup is STILL what you want

The point of BSON Size limit however is perfectly valid, since the main failing in your current aggregation pipeline is using the $size operator on the returned array. The "array" is actually the problem here, since "unbound" this has the "potential" to pull in documents from the related collection that actually causes the parent document containing this array in output to exceed the BSON limit.

So there are two basic approaches to this which you can use in order to simply obtain the "size" without actually needing to create an array of the "whole" related documents.

MongoDB 3.6 and greater

Here you would use $lookup with the "sub-pipeline" expression syntax introduced in this version to simply return the "reduced count" without actually returning any documents:

const pipeline = [
    { "$match": { "category": query.category } },
    { "$sort": { "timestamp": -1 } },
    { "$skip": (query.page - 1) * query.count },
    { "$limit": query.count },
    { "$lookup": {
      "from": "posts",
      "let": { "id": "$_id" },
      "pipeline": [
        { "$match": {
          "$expr": { "$eq": [ "$$id", "$parent" ] }
        }},
        { "$count": "count" }
      ],
      "as": "comments",
    }},
    { $addFields: {
        "comments": { 
          "$ifNull": [ { "$arrayElemAt": ["$comments.count", 0] }, 0 ]
        },
        "id": "$_id"
    }}
];

Very simply put the new "sub-pipeline" returns in the target "array" ( which is always an array ) only the output of the pipeline expression. Here not only do we $match on the local and foreign key values ( which is actually what the other $lookup form now does internally ), but we continue the pipeline using the $count stage, which is again actually a synonym for:

{ "$group": { "_id": null, "count": { "$sum": 1 } } },
{ "$project": { "_id": 0, "count": 1 } }

Point being that you only ever receive "one" document at most in the array response, which we can then easily transform into a singular value via $arrayElemAt and using $ifNull in case there were no matches in the foreign collection to obtain the count of 0

Earlier Versions

For earlier versions than MongoDB 3.6 the general idea is $unwind directly after the $lookup. This actually has a special action which is described under $lookup + $unwind Coalescence in the wider manual section on Aggregation Pipeline Optimization. Personally I view these more as "hinderance" than "optimizations" as you really should be able to "express what you mean" instead of something doing things for you "behind you back". But the basics go like this:

const pipeline = [
    { "$match": { "category": query.category } },
    { "$sort": { "timestamp": -1 } },
    { "$skip": (query.page - 1) * query.count },
    { "$limit": query.count },
    { "$lookup": {
      "from": "posts",
      "localField": "_id",
      "foreignField": "parent",
      "as": "comments"
    }},
    { "$unwind": "$comments" },
    { "$group": {
      "_id": "$_id",
      "otherField": { "$first": "$otherField" },
      "comments": { "$sum": 1 }
    }}
];

The important part here is what actually happens to both the $lookup and $unwind stages, as can be viewed using explain() to view the parsed pipeline as actually expressed by the server:

        {
            "$lookup" : {
                "from" : "posts",
                "as" : "comments",
                "localField" : "_id",
                "foreignField" : "parent",
                "unwinding" : {
                        "preserveNullAndEmptyArrays" : false
                }
            }
        }

That unwinding essentially gets "rolled into" the $lookup and the $unwind itself "disappears". This is because the combination gets translated in this "special way" which actually results in the "unwound" results of the $lookup instead of targeting an array. This is basically done so that if an "array" never actually gets created, then the BSON Limit can never be breached.

The rest is of course pretty simple in that you just use $group in order to "group back" to the original document. You can use $first as the accumulator in order to keep any fields of the document you want in the response and simply $sum to count the foreign data returned.

Since this is mongoose, I have already outlined a process for "automating" building all the fields to include with $first as part of my answer on Querying after populate in Mongoose which shows how to examine the "schema" in order to obtain that information.

Another "wrinkle" to this is the $unwind negates the "LEFT JOIN" inherent to $lookup since where there is no matches to the parent content then that "parent document" is removed from results. I'm not quite certain on this as of writing ( and should look it up later ), but the preserveNullAndEmptyArrays option did have a restriction in that it could not apply in this form of "Coalescence", however this is not the case from at least MongoDB 3.6:

const pipeline = [
    { "$match": { "category": query.category } },
    { "$sort": { "timestamp": -1 } },
    { "$skip": (query.page - 1) * query.count },
    { "$limit": query.count },
    { "$lookup": {
      "from": "posts",
      "localField": "_id",
      "foreignField": "parent",
      "as": "comments"
    }},
    { "$unwind": { "path": "$comments", "preserveNullAndEmptyArrays": true } },
    { "$group": {
      "_id": "$_id",
      "otherField": { "$first": "$otherField" },
      "comments": {
        "$sum": {
          "$cond": {
            "if": { "$eq": [ "$comments", null ]  },
            "then": 0,
            "else": 1
          }
        }
      }
    }}
];

Since I cannot actually confirm that works properly in anything other than MongoDB 3.6, then it's kind of pointless since with the newer version you should do the different form of $lookup anyway. I know that there was at least an initial problem with MongoDB 3.2 in that the preserveNullAndEmptyArrays cancelled out the "Coalescence" and hence the $lookup still returned output as an "array", and only after that stage was the array "unwound". Which defeats the purpose of doing this to avoid the BSON limit.


Do it in Code

All of that said, ultimately you are just looking for "counts" to be added to your results for the "related" comments. As long as you are not pulling in pages with "hundreds of items" then your $limit condition should keep this at a reasonable result to simply fire off count() queries to obtain the matching document counts on each key without "too much" overhead to make it unreasonable:

// Get documents
let posts = await Post.find({ "category": query.category })
    .sort({ "timestamp": -1 })
    .skip((query.page - 1) * query.count)
    .limit(query.count)
    .lean().exec();

// Map counts to each document
posts = (await Promise.all(
  posts.map(post => Comment.count({ "parent": post._id }) )
)).map((comments,i) => ({ ...posts[i], comments }) );

The "trade-off" here is that whilst run "parallel" execution of all these count() queries means additional requests to the server, the overhead of each query in itself is actually really low. Obtaining the "cursor count" of the query result is a lot more efficient than using something like the $count aggregation pipeline stage shown above.

That puts a load on the database connections when executing, but it does not have the same "processing load" and of course you're only ever looking at "counts" and no documents are returned over the wire or even "fetched" from the collection in processing the cursor results.

So that last is basically an "optimize" on the mongoose populate() process, where we don't actually ask for the "documents" and simply get a count for each query. Technically populate() would use "one" query here with $in for all the documents in the prior result. But that won't work here because you want the total per "parent", which is essentially an aggregation in a single query and response. Hence why the "multiple-requests" are issued here.

Summary

So in order to avoid the BSON Limit Problems, what you are really looking for is any of those techniques which avoids returning the "array" of related documents from your $lookup pipeline stage which is used to "join", by either obtaining the "reduced data" or by a "cursor count" technique.

There is a bit more "depth" on the BSON Size Limit and handling at:

Aggregate $lookup Total size of documents in matching pipeline exceeds maximum document size on this very site. Note that the same techniques demonstrated there to cause an error can also apply to the $facet stage as well since the 16MB limitation is a constant for anything which is a "document". And just about "everything" in MongoDB is a BSON Document, so working within the limits is extremely important.

NOTE: Purely from a "performance" perspective the biggest problem outside of the potential BSON Size Limit breach inherent in your current query is actually the $skip and $limit processing. If what you are actually implementing is more of a "Load more results..." type of functionality, then something like Implementing pagination in mongodb where you would use a "range" to start the next "page" selection by excluding previous results is a lot more performance oriented than $skip and $limit.

Paging with $skip and $limit should only be used where you really have no other option. Being on "numbered paging" where you can jump to any numbered page. And even then, it's still far better to instead "cache" results into pre-defined sets.

But that really is a "whole other question" than the essential question here about the BSON Size Limit.

like image 135
Neil Lunn Avatar answered Nov 04 '22 01:11

Neil Lunn