Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MongoDB Ordering by average combined numbers or nested sub arrays

Having some issues working out the best way to do this in MongoDB, arguably its a relation data set so I will probably be slated. Still its a challenge to see if its possible.

I currently need to order by a Logistics Managers' daily average miles across the vans in their department and also in a separate list a combined weekly average.

Mr First setup in the database was as follows

{
   "_id" : ObjectId("555cf04fa3ed8cc2347b23d7"),
   "name" : "My Manager 1",
   "vans" : [ 
       {
           "name" : "van1",
           "miles" : NumberLong(56)
       },
       {
           "name" : "van2",
           "miles" : NumberLong(34)
       }
   ]
}

But I can't see how to order by a nested array value without knowing the parent array keys (these will be standard 0-x)

So my next choice was to scrap that idea just have the name in the first collection and the vans in the second collection with Id of the manager.

So removing vans from the above example and adding this collection (vans)

{
   "_id" : ObjectId("555cf04fa3ed8cc2347b23d9"),
   "name" : "van1",
   "miles" : NumberLong(56),
   "manager_id" : "555cf04fa3ed8cc2347b23d7"
}

But because I need show the results by manager, how do I order in a query (if possible) the average miles in this collection where id=x and then display the manager by his id.

Thanks for your help

like image 418
deejuk Avatar asked May 21 '15 07:05

deejuk


2 Answers

If the Manager is going to have limited number of Vans, then your first approach is better, as you do not have to make two separate calls/queries to the database to collect your information.

Then comes the question how to calculate the average milage per Manager, where the Aggregation Framework will help you a lot. Here is a query that will get you the desired data:

db.manager.aggregate([
                        {$unwind: "$vans"},
                        {$group:
                                  {_id: 
                                         {
                                            _id: "$_id", 
                                            name: "$name"
                                         }, 
                                   avg_milage: {$avg: "$vans.miles"}
                                  }
                        },
                        {$sort: {"avg_milage": -1}},
                        {$project: 
                                    {_id: "$_id._id", 
                                     name: "$_id.name", 
                                     avg_milage: "$avg_milage"
                                    }
                        }
                    ])

The first $unwind step simply unwraps the vans array, and creates a separate documents for each element of the array.

Then the $group stage gets all documents with the same (_id, name) pair, and in the avg_milage field, counts the average value of miles field out of those documents.

The $sort stage is obvious, it just sorts the documents in the descending order, using the new avg_milage field as the sort key.

And finally, the last $project step just cleans up the documents by making appropriate projections, just for beauty :)

A similar thing is needed for your second desired result:

db.manager.aggregate([
                        {$unwind: "$vans"},
                        {$group:
                                  {_id: 
                                         {
                                            _id: "$_id", 
                                            name: "$name"
                                         }, 
                                   total_milage: {$sum: "$vans.miles"}
                                  }
                        },
                        {$sort: {"total_milage": -1}},
                        {$project: 
                                    {_id: "$_id._id", 
                                     name: "$_id.name", 
                                     weekly_milage: {
                                                       $multiply: [
                                                                     "$total_milage", 
                                                                     7
                                                                  ]
                                                    }

                                    }
                        }
                    ])

This will produce the list of Managers with their weekly milage, sorted in descending order. So you can $limit the result, and get the Manager with the highest milage for instance.

And in pretty much similar way, you can grab info for your vans:

db.manager.aggregate([
                        {$unwind: "$vans"},
                        {$group:
                                  {_id: "$vans.name", 
                                   total_milage: {$sum: "$vans.miles"}
                                  }
                        },
                        {$sort: {"total_milage": -1}},
                        {$project: 
                                    {van_name: "$_id", 
                                     weekly_milage: {
                                                       $multiply: [
                                                                     "$total_milage", 
                                                                     7
                                                                  ]
                                                    }

                                    }
                        }
                    ])
like image 95
bagrat Avatar answered Oct 24 '22 17:10

bagrat


First, do you require average miles for a single day, average miles over a given time period, or average miles over the life of the manager? I would consider adding a timestamp field. Yes, _id has a timestamp, but this only reflects the time the document was created, not necessarily the time of the initial day's log.

Considerations for the first data model:

  • Does each document represent one day, or one manager?
  • How many "vans" do you expect to have in the array? Does this list grow over time? Do you need to consider the 16MB max doc size in a year or two from now?

Considerations for the second data model:

  • Can you store the manager's name as the "manager_id" field? Can this be used as a possible unique ID for a secondary meta lookup? Doing so would limit the necessity of a secondary manager meta-data lookup just to get their name.

As @n9code has pointed out, the aggregation framework is the answer in both cases.

For the first data model, assuming each document represents one day and you want to retrieve an average for a given day or a range of days:

db.collection.aggregate([
    { $match: {
        name: 'My Manager 1',
        timestamp: { $gte: ISODate(...), $lt: ISODate(...) }
    } },
    { $unwind: '$vans' },
    { $group: {
        _id: {
            _id: '$_id',
            name: '$name',
            timestamp: '$timestamp'
        },
        avg_mileage: {
            $avg: '$miles'
        }
    } },
    { $sort: {
        avg_mileage: -1
    } },
    { $project: {
        _id: '$_id._id',
        name: '$_id.name',
        timestamp: '$_id.timestamp',
        avg_mileage: 1
    } }
]);

If, for the first data model, each document represents a manager and the "vans" array grows daily, this particular data model is not ideal for two reasons:

  • "vans" array may grow beyond max document size... eventually, although that would be a lot of data
  • It is more difficult and memory intensive to limit a certain date range since timestamp at this point would be nested within an item of "vans" and not in the root of the document

For the sake of completeness, here is the query:

/*
Assuming data model is:
{
    _id: ...,
    name: ...,
    vans: [
        { name: ..., miles: ..., timestamp: ... }
    ]
}
*/

db.collection.aggregate([
    { $match: {
        name: 'My Manager 1'
    } },
    { $unwind: '$vans' },
    { $match: {
        'vans.timestamp': { $gte: ISODate(...), $lt: ISODate(...) }
    } },
    { $group: {
        _id: {
            _id: '$_id',
            name: '$name'
        },
        avg_mileage: {
            $avg: '$miles'
        }
    } },
    { $sort: {
        avg_mileage: -1
    } },
    { $project: {
        _id: '$_id._id',
        name: '$_id.name',
        avg_mileage: 1
    } }
]);

For the second data model, aggregation is more straightforward. I'm assuming the inclusion of a timestamp:

db.collection.aggregate([
    { $match: {
        manager_id: ObjectId('555cf04fa3ed8cc2347b23d7')
        timestamp: { $gte: ISODate(...), $lt: ISODate(...) }
    } },
    { $group: {
        _id: '$manager_id'
        },
        avg_mileage: {
            $avg: '$miles'
        }
        names: {
            $addToSet: '$name'
        }
    } },
    { $sort: {
        avg_mileage: -1
    } },
    { $project: {
        manager_id: '$_id',
        avg_mileage: 1
        names: 1
    } }
]);

I have added an array of names (vehicles?) used during the average computation.

Relevant documentation:

  • $match, $unwind, $group, $sort, $project - Pipeline Aggregation Stages
  • $avg, $addToSet - Group Accumulator Operators
  • Date types
  • ObjectId.getTimestamp
like image 21
zamnuts Avatar answered Oct 24 '22 16:10

zamnuts