I'm trying to use a mongodb aggregation query to join($lookup) two collections and then distinct count all the unique values in the joined array. *Note: I don't necessarily know what fields(keys) are in the metaDataMap array. And I don't want to count or include fields that might or might not exist in the Map. So that's why the aggregation query looks like it does.
So my two collections look like this: events-
{
"_id" : "1",
"name" : "event1",
"objectsIds" : [ "1", "2", "3" ],
}
Objects
{
"_id" : "1",
"name" : "object1",
"metaDataMap" : {
"SOURCE" : ["ABC", "DEF"],
"DESTINATION" : ["XYZ", "PDQ"],
"TYPE" : []
}
},
{
"_id" : "2",
"name" : "object2",
"metaDataMap" : {
"SOURCE" : ["RST", "LNE"],
"TYPE" : ["text"]
}
},
{
"_id" : "3",
"name" : "object3",
"metaDataMap" : {
"SOURCE" : ["NOP"],
"DESTINATION" : ["PHI", "NYC"],
"TYPE" : ["video"]
}
}
My results are
{
_id:"SOURCE", count:5
_id:"DESTINATION", count: 4
_id:"TYPE", count: 2
}
What I have so far is this:
db.events.aggregate([
{$match: {"_id" : id}}
,{$lookup: {"from" : "objects",
"localField" : "objectsIds",
"foreignField" : "_id",
"as" : "objectResults"}}
,{$unwind: "$objectResults"} //Line 1
,{$project: {x: "$objectResults.metaDataMap"}} //Line 2
,{$unwind: "$x"}
,{$project: {"_id":0}}
,{$project: {x: {$objectToArray: "$x"}}}
,{$unwind: "$x"}
,{$group: {_id: "$x.k", tmp: {$push: "$x.v"}}}
,{$addFields: {tmp: {$reduce:{
input: "$tmp",
initialValue:[],
in:{$concatArrays: [ "$$value", "$$this"]}
}}
}}
,{$unwind: "$tmp"}
,{$group: {_id: "$_id", uniqueVals: {$addToSet: "$tmp"}}}
,{$addFields: {count: {"$size":"$uniqueVals"}}}
,{$project: {_id: "$_id", count: "$count"}}
]);
My issue is were I marked line 1&2. The above works but takes around 50 seconds for 25,000 values in the metaDataMap array fields(objectsResults.metaDataMap). So for example having a 25,000 values in object 1 metaDataMap SOURCE array. That's way to slow. My other faster way to do it was to replace line 1&2 with:
,{$project: {x: "$objectResults.metaDataMap"}} //Line 1
,{$unwind: "$x"} //Line 2
This is way faster (under 3 seconds) but can only be run on datasets that have ~10,000 items or less. Anything higher and I get a error saying "exceeds maximum document size".
Please help!
If you're able to alter your schema design on the object
collection to include a parent_id
field, you can immediately remove the first 4 stages of your pipeline (the first $match
, $lookup
, $unwind
, and $project
). This will make the concern about Line 1
and Line 2
disappear.
For example, a document in the object
collection would look like:
{
"_id": "1",
"name": "object1",
"metaDataMap": {
"SOURCE": [
"ABC",
"DEF"
],
"DESTINATION": [
"XYZ",
"PDQ"
],
"TYPE": [ ]
},
"parent_id": "1"
}
Thus you don't need the expensive $lookup
and $unwind
. The first 4 stages can then be replaced with:
{$match: {parent_id: id}}
Based on this idea, I did further optimization of the pipeline, which resulted in:
db.objects.aggregate([
{$match: {parent_id: id}}
,{$project: {metaDataMap: {$filter: {input: {$objectToArray: '$metaDataMap'}, cond: {$ne: [[], '$$this.v']}}}}}
,{$unwind: '$metaDataMap'}
,{$unwind: '$metaDataMap.v'}
,{$group: {_id: '$metaDataMap.k', val: {$addToSet: '$metaDataMap.v'}}}
,{$project: {count: {$size: '$val'}}}
])
This will output:
{ "_id": "TYPE", "count": 2 }
{ "_id": "DESTINATION", "count": 4 }
{ "_id": "SOURCE", "count": 5 }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With