Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MongoDB advanced aggregation

I'm a total newbie to MongoDB. I work on a privat project for my golf club to analyze the round.

I use meteorJS for the Application and tried some aggregation on the command line. But I'm not sure if I even have the right point to the task

A sample document:

{
    "_id" : "2KasYR3ytsaX8YuoT",
    "course" : {
        "id" : "rHmYJBhRtSt38m68s",
        "name" : "CourseXYZ"
    },
    "player" : {
        "id" : "tdaYaSvXJueDq4oTN",
        "firstname" : "Just",
        "lastname" : "aPlayer"
    },
    "event" : "Training Day",
    "tees" : [
        {
            "tee" : 1,
            "par" : 4,
            "fairway" : "straight",
            "greenInRegulation" : true,
            "putts" : 3,
            "strokes" : 5
        },
        {
            "tee" : 2,
            "par" : 5,
            "fairway" : "right",
            "greenInRegulation" : true,
            "putts" : 2,
            "strokes" : 5
        },
        {
            "tee" : 3,
            "par" : 5,
            "fairway" : "right",
            "greenInRegulation" : false,
            "shotType": "bunker",
            "putts" : 2,
            "strokes" : 5
        }
    ]
}

My attempt so far:

db.analysis.aggregate([
   {$unwind: "$tees"}, 
   {$group: { 
       _id:"$player.id",
       strokes: {$sum: "$tees.strokes"},
       par: {$sum: "$tees.par"},
       putts: {$sum: "$tees.putts"},
       teesPlayed: {$sum:1}
   }}
])

And what I want for a result

{ 
    "_id" : "tdaYaSvXJueDq4oTN", 
    "strokes" : 15, 
    "par" : 14, 
    "putts" : 7, 
    "teesPlayed" : 3 
    // here comes what I want to add:
    "fairway.straight": 1 // where tees.fairway equals "straight"
    "fairway.right": 2 // where tees.fraiway equals "right" (etc.)
    "shotType.bunker": 1 // where shotType equals "bunker" etc.
}
like image 878
yaf Avatar asked Feb 07 '23 16:02

yaf


1 Answers

There are a few ways of approaching this depending on your overall needs and which MongoDB server version you have available as a target for your project.

Whilst "meteor" installations and default project setups do not "bundle" a MongoDB 3.2 instance, there is no need why your project cannot use such an instance as an external target. If it's a new project to get off the ground, then I would highly recommend working against the latest version available. And maybe even possibly against latest development releases, depending on your own targeted release cycle. Work with what is most fresh, and your application will be too.

For that reason, we start with the latest at the top of the list.

MongoDB 3.2 way - Fast

The big feature in MongoDB 3.2 that makes it really stand out here in terms of performance is a change in how $sum operates. Previously just as an accumulator operator for $group this would work on singular numeric values to produce a total.

The big improvement is hidden within the $project stage usage which is added, where $sum can be directly applied to an array of values. i.e { "$sum": [1,2,3] } results in 6. So now you can "nest" the operations with anything that produces an array of values from a source. Most notably here is $map:

db.analysis.aggregate([
    { "$group": {
        "_id": "$player.id",
        "strokes": {
            "$sum": { 
                "$sum": {
                    "$map": {
                        "input": "$tees",
                        "as": "tee",
                        "in": "$$tee.strokes"
                    }
                }
            }
        },
        "par": {
            "$sum": {
                "$sum": {
                    "$map": {
                        "input": "$tees",
                        "as": "tee",
                        "in": "$$tee.par"
                    }
                }
            }
        },
        "putts": {
            "$sum": {
                "$sum": {
                    "$map": {
                        "input": "$tees",
                        "as": "tee",
                        "in": "$$tee.putts"
                    }
                }
            }
         },
        "teesPlayed": { "$sum": { "$size": "$tees" } },
        "shotsRight": {
            "$sum": {
                "$size": {
                    "$filter": {
                        "input": "$tees",
                        "as": "tee",
                        "cond": { "$eq": [ "$$tee.fairway", "right" ] }
                    }
                }
            }
        },
        "shotsStraight": {
            "$sum": {
                "$size": {
                    "$filter": {
                        "input": "$tees",
                        "as": "tee",
                        "cond": { "$eq": [ "$$tee.fairway", "straight" ] }
                    }
                }
            }
        },
        "bunkerShot": {
            "$sum": {
                "$size": {
                    "$filter": {
                        "input": "$tees",
                        "as": "tee",
                        "cond": { "$eq": [ "$$tee.shotType", "bunker" ] }
                    }
                }
            }
        }
    }}
])

So here each field is split out by either doing the double $sum trick on the single field values from the array items, or in contrast the arrays are being processed with $filter to just restrict to matching items and processed for lenght of matches with $size, for the result fields that rather want "counts".

Though this looks long winded in pipeline construction it will yield the fasted results. And though you need to specify all of the keys to result with the associated logic, there is nothing stopping "generation" of the data structure necessary for the pipeline as the result of other queries on the data set.

The other Aggregate Way - A bit slower

Of course not every project can practically use the latest version of things. So before a MongoDB 3.2 release that introduced some of the operators used above, the only real practical way to work with array data and conditionally work with different elements and sums was to process first with $unwind.

So essentially we start with the query you began to construct, but then add in the handling for the different fields:

db.analysis.aggregate([
    { "$unwind": "$tees" },
    { "$group": {
        "_id": "$player.id",
        "strokes": { "$sum": "$tees.strokes" },
        "par": { "$sum": "$tees.par" },
        "putts": { "$sum": "$tees.putts" },
        "teedsPlayed": { "$sum": 1 },
        "shotsRight": {
            "$sum": {
                "$cond": [
                    { "$eq": [ "$tees.fairway", "right" ] },
                    1,
                    0
                ]
            }
        },
        "shotsStraight": {
            "$sum": {
                "$cond": [
                    { "$eq": [ "$tees.fairway", "straight" ] },
                    1,
                    0
                ]
            }
        },
        "bunkerShot": {
            "$sum": {
                "$cond": [
                    { "$eq": [ "$tees.shotType", "bunker" ] },
                    1,
                    0
                ]
            }
        }
    }}
])

So you should note that there is still "some" similarity to the first listing, in that where the $filter statements all have some logic within there "cond" argument, that logic is rather transposed to the $cond operator here.

As a "ternary" operator ( if/then/else) , it's job it is to evaluate a logical condition (if) and either return the next argument where that condition was true (then) or otherwise return the last argument where it is false (else). In this case either 1 or 0 depending on whether the tested condition matched. This gives the "counts" to $sum as is required.

In either statement, the produced results come out like this:

{
        "_id" : "tdaYaSvXJueDq4oTN",
        "strokes" : 15,
        "par" : 14,
        "putts" : 7,
        "teesPlayed" : 3,
        "shotsRight" : 2,
        "shotsStraight" : 1,
        "bunkerShot" : 1
}

Since this is an aggregate statement with $group, then one rule is that the "keys" ( apart from needing to be specified in the constructed statement ) must be in the "top-level" of the structure. So no "nested" structures are allowed within a $group, hence the whole names for each key.

If you really must transform, then you can by adding a $project stage following the $group in each example:

{ "$project": {
    "strokes": 1,
    "par": 1,
    "putts": 1,
    "teesPlayed": 1,
    "fairway": {
        "straight": "$shotsStraight",
        "right": "$shotsRight"
    },
    "shotType": {
        "bunker": "$bunkerShot"
    }
}}

So a bit of "re-shaping" can be done, but of course all the names and structure must be specified, though again you could in theory just generate this all in code. It is just a data structure after all.

The bottom line here is that $unwind adds cost, and quite a lot of cost. It is basically going to add a copy of each document in the pipeline for processing "per" every array element contained in each document. So not is there only the cost of processing all of those produced things, but also a cost of "producing" them in the first place.

MapReduce - Slower still, but more flexible on the keys

And finally as an approach

db.analysis.mapReduce(
    function() {

        var data = { "strokes": 0 ,"par": 0, "putts": 0, "teesPlayed": 0, "fairway": {} };

        this.tees.forEach(function(tee) {
            // Increment common values
            data.strokes += tee.strokes;
            data.par += tee.par;
            data.putts += tee.putts;
            data.teesPlayed++;

            // Do dynamic keys
            if (!data.fairway.hasOwnProperty(tee.fairway))
                data.fairway[tee.fairway] = 0;
            data.fairway[tee.fairway]++;

            if (tee.hasOwnProperty('shotType')) {
                if (!data.hasOwnProperty('shotType'))
                    data.shotType = {};
                if (!data.shotType.hasOwnProperty(tee.shotType))
                    data.shotType[tee.shotType] = 0;
                data.shotType[tee.shotType]++
            }

        });

        emit(this.player.id,data);
    },
    function(key,values) {
        var data = { "strokes": 0 ,"par": 0, "putts": 0, "teesPlayed": 0, "fairway": {} };

        values.forEach(function(value) {
            // Common keys
            data.strokes += value.strokes;
            data.par += value.par;
            data.putts += value.putts;
            data.teesPlayed += value.teesPlayed;

            Object.keys(value.fairway).forEach(function(fairway) {
                if (!data.fairway.hasOwnProperty(fairway))
                    data.fairway[fairway] = 0;
                data.fairway[fairway] += value.fairway[fairway];
            });

            if (value.hasOwnProperty('shotType')) {
                if (!data.hasOwnProperty('shotType'))
                    data.shotType = {};
                Object.keys(value.shotType).forEach(function(shotType) {
                    if (!data.shotType.hasOwnProperty(shotType))
                        data.shotType[shotType] = 0;
                    data.shotType[shotType] += value.shotType[shotType];
                });
            }
        });

        return data;

    },
    { "out": { "inline": 1 } }
)

And the output from this can be done immediately with the nested structure, but of course in the very mapReduce output form of "key/value", being that "key" is the grouping _id and "value" contains all the output:

            {
                    "_id" : "tdaYaSvXJueDq4oTN",
                    "value" : {
                            "strokes" : 15,
                            "par" : 14,
                            "putts" : 7,
                            "teesPlayed" : 3,
                            "fairway" : {
                                    "straight" : 1,
                                    "right" : 2
                            },
                            "shotType" : {
                                    "bunker" : 1
                            }
                    }
            }

The "out" options for mapReduce are either the "inline" as shown here where you can fit all the result in memory ( and within the 16MB BSON limit ), or alternately to another collection from which you can read later. There is a similar $out for .aggregate(), but this is generally negated by aggregation output being available as a "cursor", unless of course you really want it in a collection instead.


Concluding

So it all depends on how you really want to approach this. If speed is of the upmost importance then .aggregate() is generally going to yield the fastest results. On the other hand if you want to work "dynamically" with the produced "keys" then mapReduce allows the logic to be generally self contained, without the need for another inspection pass to generate the required aggregation pipeline statement.

like image 185
Blakes Seven Avatar answered Mar 13 '23 01:03

Blakes Seven