Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mongodb aggregation pipeline size and speed issue

I'm trying to use a mongodb aggregation query to join($lookup) two collections and then distinct count all the unique values in the joined array. *Note: I don't necessarily know what fields(keys) are in the metaDataMap array. And I don't want to count or include fields that might or might not exist in the Map. So that's why the aggregation query looks like it does.

So my two collections look like this: events-

{
"_id" : "1",
"name" : "event1",
"objectsIds" : [ "1", "2", "3" ],
}

Objects

{
"_id" : "1",
"name" : "object1",
"metaDataMap" : { 
                     "SOURCE" : ["ABC", "DEF"],
                     "DESTINATION" : ["XYZ", "PDQ"],
                     "TYPE" : []
                }
},
{
"_id" : "2",
"name" : "object2",
"metaDataMap" : { 
                     "SOURCE" : ["RST", "LNE"],
                     "TYPE" : ["text"]
                }
},
{
"_id" : "3",
"name" : "object3",
"metaDataMap" : { 
                     "SOURCE" : ["NOP"],
                     "DESTINATION" : ["PHI", "NYC"],
                     "TYPE" : ["video"]
                }
}

My results are

{
_id:"SOURCE", count:5
_id:"DESTINATION", count: 4
_id:"TYPE", count: 2
}

What I have so far is this:

db.events.aggregate([
{$match: {"_id" : id}}

,{$lookup: {"from" : "objects",
        "localField" : "objectsIds",
        "foreignField" : "_id",
        "as" : "objectResults"}}

,{$unwind: "$objectResults"} //Line 1
,{$project: {x: "$objectResults.metaDataMap"}} //Line 2


,{$unwind: "$x"}
,{$project: {"_id":0}}

,{$project: {x: {$objectToArray: "$x"}}}
,{$unwind: "$x"}

,{$group: {_id: "$x.k", tmp: {$push: "$x.v"}}}

,{$addFields: {tmp: {$reduce:{
input: "$tmp",
initialValue:[],
in:{$concatArrays: [ "$$value", "$$this"]}
    }}
}}

,{$unwind: "$tmp"}
,{$group: {_id: "$_id", uniqueVals: {$addToSet: "$tmp"}}}

,{$addFields: {count: {"$size":"$uniqueVals"}}}
,{$project: {_id: "$_id", count: "$count"}}
]);

My issue is were I marked line 1&2. The above works but takes around 50 seconds for 25,000 values in the metaDataMap array fields(objectsResults.metaDataMap). So for example having a 25,000 values in object 1 metaDataMap SOURCE array. That's way to slow. My other faster way to do it was to replace line 1&2 with:

 ,{$project: {x: "$objectResults.metaDataMap"}} //Line 1
 ,{$unwind: "$x"} //Line 2

This is way faster (under 3 seconds) but can only be run on datasets that have ~10,000 items or less. Anything higher and I get a error saying "exceeds maximum document size".

Please help!

like image 669
Deckard Avatar asked Nov 07 '22 15:11

Deckard


1 Answers

If you're able to alter your schema design on the object collection to include a parent_id field, you can immediately remove the first 4 stages of your pipeline (the first $match, $lookup, $unwind, and $project). This will make the concern about Line 1 and Line 2 disappear.

For example, a document in the object collection would look like:

{
  "_id": "1",
  "name": "object1",
  "metaDataMap": {
    "SOURCE": [
      "ABC",
      "DEF"
    ],
    "DESTINATION": [
      "XYZ",
      "PDQ"
    ],
    "TYPE": [ ]
  },
  "parent_id": "1"
}

Thus you don't need the expensive $lookup and $unwind. The first 4 stages can then be replaced with:

{$match: {parent_id: id}}

Based on this idea, I did further optimization of the pipeline, which resulted in:

db.objects.aggregate([
     {$match: {parent_id: id}}
    ,{$project: {metaDataMap: {$filter: {input: {$objectToArray: '$metaDataMap'}, cond: {$ne: [[], '$$this.v']}}}}}
    ,{$unwind: '$metaDataMap'}
    ,{$unwind: '$metaDataMap.v'}
    ,{$group: {_id: '$metaDataMap.k', val: {$addToSet: '$metaDataMap.v'}}}
    ,{$project: {count: {$size: '$val'}}}
])

This will output:

{ "_id": "TYPE", "count": 2 }
{ "_id": "DESTINATION", "count": 4 }
{ "_id": "SOURCE", "count": 5 }
like image 69
kevinadi Avatar answered Nov 15 '22 07:11

kevinadi