I have a pretty simple $lookup
aggregation query like the following:
{'$lookup': {'from': 'edge', 'localField': 'gid', 'foreignField': 'to', 'as': 'from'}}
When I run this on a match with enough documents I get the following error:
Command failed with error 4568: 'Total size of documents in edge matching { $match: { $and: [ { from: { $eq: "geneDatabase:hugo" } }, {} ] } } exceeds maximum document size' on server
All attempts to limit the number of documents fail. allowDiskUse: true
does nothing. Sending a cursor
in does nothing. Adding in a $limit
into the aggregation also fails.
How could this be?
Then I see the error again. Where did that $match
and $and
and $eq
come from? Is the aggregation pipeline behind the scenes farming out the $lookup
call to another aggregation, one it runs on its own that I have no ability to provide limits for or use cursors with??
What is going on here?
Here, the aggregate() function is used to perform aggregation it can have three operators stages, expression and accumulator. Stages: Each stage starts from stage operators which are: $match: It is used for filtering the documents can reduce the amount of documents that are given as input to the next stage.
To use the $merge stage, it must be the last stage in the pipeline. New in version 4.2. Writes the resulting documents of the aggregation pipeline to a collection. To use the $out stage, it must be the last stage in the pipeline.
mongosh Method For MongoDB API drivers, refer to the language-specific MongoDB driver documentation. Use allowDiskUse() to either allow or prohibit writing temporary files on disk when a pipeline stage exceeds the 100 megabyte limit.
For performing MongoDB Join two collections, you must use the $lookup operator. It is defined as a stage that executes a left outer join with another collection and aids in filtering data from joined documents. For example, if a user requires all grades from all students, then the below query can be written: Students.
As stated earlier in comment, the error occurs because when performing the $lookup
which by default produces a target "array" within the parent document from the results of the foreign collection, the total size of documents selected for that array causes the parent to exceed the 16MB BSON Limit.
The counter for this is to process with an $unwind
which immediately follows the $lookup
pipeline stage. This actually alters the behavior of $lookup
in such that instead of producing an array in the parent, the results are instead a "copy" of each parent for every document matched.
Pretty much just like regular usage of $unwind
, with the exception that instead of processing as a "separate" pipeline stage, the unwinding
action is actually added to the $lookup
pipeline operation itself. Ideally you also follow the $unwind
with a $match
condition, which also creates a matching
argument to also be added to the $lookup
. You can actually see this in the explain
output for the pipeline.
The topic is actually covered (briefly) in a section of Aggregation Pipeline Optimization in the core documentation:
$lookup + $unwind Coalescence
New in version 3.2.
When a $unwind immediately follows another $lookup, and the $unwind operates on the as field of the $lookup, the optimizer can coalesce the $unwind into the $lookup stage. This avoids creating large intermediate documents.
Best demonstrated with a listing that puts the server under stress by creating "related" documents that would exceed the 16MB BSON limit. Done as briefly as possible to both break and work around the BSON Limit:
const MongoClient = require('mongodb').MongoClient; const uri = 'mongodb://localhost/test'; function data(data) { console.log(JSON.stringify(data, undefined, 2)) } (async function() { let db; try { db = await MongoClient.connect(uri); console.log('Cleaning....'); // Clean data await Promise.all( ["source","edge"].map(c => db.collection(c).remove() ) ); console.log('Inserting...') await db.collection('edge').insertMany( Array(1000).fill(1).map((e,i) => ({ _id: i+1, gid: 1 })) ); await db.collection('source').insert({ _id: 1 }) console.log('Fattening up....'); await db.collection('edge').updateMany( {}, { $set: { data: "x".repeat(100000) } } ); // The full pipeline. Failing test uses only the $lookup stage let pipeline = [ { $lookup: { from: 'edge', localField: '_id', foreignField: 'gid', as: 'results' }}, { $unwind: '$results' }, { $match: { 'results._id': { $gte: 1, $lte: 5 } } }, { $project: { 'results.data': 0 } }, { $group: { _id: '$_id', results: { $push: '$results' } } } ]; // List and iterate each test case let tests = [ 'Failing.. Size exceeded...', 'Working.. Applied $unwind...', 'Explain output...' ]; for (let [idx, test] of Object.entries(tests)) { console.log(test); try { let currpipe = (( +idx === 0 ) ? pipeline.slice(0,1) : pipeline), options = (( +idx === tests.length-1 ) ? { explain: true } : {}); await new Promise((end,error) => { let cursor = db.collection('source').aggregate(currpipe,options); for ( let [key, value] of Object.entries({ error, end, data }) ) cursor.on(key,value); }); } catch(e) { console.error(e); } } } catch(e) { console.error(e); } finally { db.close(); } })();
After inserting some initial data, the listing will attempt to run an aggregate merely consisting of $lookup
which will fail with the following error:
{ MongoError: Total size of documents in edge matching pipeline { $match: { $and : [ { gid: { $eq: 1 } }, {} ] } } exceeds maximum document size
Which is basically telling you the BSON limit was exceeded on retrieval.
By contrast the next attempt adds the $unwind
and $match
pipeline stages
The Explain output:
{ "$lookup": { "from": "edge", "as": "results", "localField": "_id", "foreignField": "gid", "unwinding": { // $unwind now is unwinding "preserveNullAndEmptyArrays": false }, "matching": { // $match now is matching "$and": [ // and actually executed against { // the foreign collection "_id": { "$gte": 1 } }, { "_id": { "$lte": 5 } } ] } } }, // $unwind and $match stages removed { "$project": { "results": { "data": false } } }, { "$group": { "_id": "$_id", "results": { "$push": "$results" } } }
And that result of course succeeds, because as the results are no longer being placed into the parent document then the BSON limit cannot be exceeded.
This really just happens as a result of adding $unwind
only, but the $match
is added for example to show that this is also added into the $lookup
stage and that the overall effect is to "limit" the results returned in an effective way, since it's all done in that $lookup
operation and no other results other than those matching are actually returned.
By constructing in this way you can query for "referenced data" that would exceed the BSON limit and then if you want $group
the results back into an array format, once they have been effectively filtered by the "hidden query" that is actually being performed by $lookup
.
As all the content above notes, the BSON Limit is a "hard" limit that you cannot breach and this is generally why the $unwind
is necessary as an interim step. There is however the limitation that the "LEFT JOIN" becomes an "INNER JOIN" by virtue of the $unwind
where it cannot preserve the content. Also even preserveNulAndEmptyArrays
would negate the "coalescence" and still leave the intact array, causing the same BSON Limit problem.
MongoDB 3.6 adds new syntax to $lookup
that allows a "sub-pipeline" expression to be used in place of the "local" and "foreign" keys. So instead of using the "coalescence" option as demonstrated, as long as the produced array does not also breach the limit it is possible to put conditions in that pipeline which returns the array "intact", and possibly with no matches as would be indicative of a "LEFT JOIN".
The new expression would then be:
{ "$lookup": { "from": "edge", "let": { "gid": "$gid" }, "pipeline": [ { "$match": { "_id": { "$gte": 1, "$lte": 5 }, "$expr": { "$eq": [ "$$gid", "$to" ] } }} ], "as": "from" }}
In fact this would be basically what MongoDB is doing "under the covers" with the previous syntax since 3.6 uses $expr
"internally" in order to construct the statement. The difference of course is there is no "unwinding"
option present in how the $lookup
actually gets executed.
If no documents are actually produced as a result of the "pipeline"
expression, then the target array within the master document will in fact be empty, just as a "LEFT JOIN" actually does and would be the normal behavior of $lookup
without any other options.
However the output array to MUST NOT cause the document where it is being created to exceed the BSON Limit. So it really is up to you to ensure that any "matching" content by the conditions stays under this limit or the same error will persist, unless of course you actually use $unwind
to effect the "INNER JOIN".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With