Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

mongodb group values by multiple fields

People also ask

How do I group multiple fields in MongoDB?

One of the most efficient ways of grouping the various fields present inside the documents of MongoDB is by using the $group operator, which helps in performing multiple other aggregation functions as well on the grouped data.

How do I group values in MongoDB?

Use the _id field in the $group pipeline stage to set the group key. See below for usage examples. In the $group stage output, the _id field is set to the group key for that document. The output documents can also contain additional fields that are set using accumulator expressions.

How do I display multiple fields in MongoDB?

In MongoDB, When you want to perform any operation on multiple fields then you have to use $group aggregation. You will more understand with help of examples. Example: In the example, I will show you how you can display some particular documents with multiple fields when we have a large dataset in the collection.

Can we do group by in MongoDB?

MongoDB group by is used to group data from the collection, we can achieve group by clause using aggregate function and group method in MongoDB. While using aggregate function with group by clause query operations is faster as normal query, basically aggregate function is used in multiple condition.


TLDR Summary

In modern MongoDB releases you can brute force this with $slice just off the basic aggregation result. For "large" results, run parallel queries instead for each grouping ( a demonstration listing is at the end of the answer ), or wait for SERVER-9377 to resolve, which would allow a "limit" to the number of items to $push to an array.

db.books.aggregate([
    { "$group": {
        "_id": {
            "addr": "$addr",
            "book": "$book"
        },
        "bookCount": { "$sum": 1 }
    }},
    { "$group": {
        "_id": "$_id.addr",
        "books": { 
            "$push": { 
                "book": "$_id.book",
                "count": "$bookCount"
            },
        },
        "count": { "$sum": "$bookCount" }
    }},
    { "$sort": { "count": -1 } },
    { "$limit": 2 },
    { "$project": {
        "books": { "$slice": [ "$books", 2 ] },
        "count": 1
    }}
])

MongoDB 3.6 Preview

Still not resolving SERVER-9377, but in this release $lookup allows a new "non-correlated" option which takes an "pipeline" expression as an argument instead of the "localFields" and "foreignFields" options. This then allows a "self-join" with another pipeline expression, in which we can apply $limit in order to return the "top-n" results.

db.books.aggregate([
  { "$group": {
    "_id": "$addr",
    "count": { "$sum": 1 }
  }},
  { "$sort": { "count": -1 } },
  { "$limit": 2 },
  { "$lookup": {
    "from": "books",
    "let": {
      "addr": "$_id"
    },
    "pipeline": [
      { "$match": { 
        "$expr": { "$eq": [ "$addr", "$$addr"] }
      }},
      { "$group": {
        "_id": "$book",
        "count": { "$sum": 1 }
      }},
      { "$sort": { "count": -1  } },
      { "$limit": 2 }
    ],
    "as": "books"
  }}
])

The other addition here is of course the ability to interpolate the variable through $expr using $match to select the matching items in the "join", but the general premise is a "pipeline within a pipeline" where the inner content can be filtered by matches from the parent. Since they are both "pipelines" themselves we can $limit each result separately.

This would be the next best option to running parallel queries, and actually would be better if the $match were allowed and able to use an index in the "sub-pipeline" processing. So which is does not use the "limit to $push" as the referenced issue asks, it actually delivers something that should work better.


Original Content

You seem have stumbled upon the top "N" problem. In a way your problem is fairly easy to solve though not with the exact limiting that you ask for:

db.books.aggregate([
    { "$group": {
        "_id": {
            "addr": "$addr",
            "book": "$book"
        },
        "bookCount": { "$sum": 1 }
    }},
    { "$group": {
        "_id": "$_id.addr",
        "books": { 
            "$push": { 
                "book": "$_id.book",
                "count": "$bookCount"
            },
        },
        "count": { "$sum": "$bookCount" }
    }},
    { "$sort": { "count": -1 } },
    { "$limit": 2 }
])

Now that will give you a result like this:

{
    "result" : [
            {
                    "_id" : "address1",
                    "books" : [
                            {
                                    "book" : "book4",
                                    "count" : 1
                            },
                            {
                                    "book" : "book5",
                                    "count" : 1
                            },
                            {
                                    "book" : "book1",
                                    "count" : 3
                            }
                    ],
                    "count" : 5
            },
            {
                    "_id" : "address2",
                    "books" : [
                            {
                                    "book" : "book5",
                                    "count" : 1
                            },
                            {
                                    "book" : "book1",
                                    "count" : 2
                            }
                    ],
                    "count" : 3
            }
    ],
    "ok" : 1
}

So this differs from what you are asking in that, while we do get the top results for the address values the underlying "books" selection is not limited to only a required amount of results.

This turns out to be very difficult to do, but it can be done though the complexity just increases with the number of items you need to match. To keep it simple we can keep this at 2 matches at most:

db.books.aggregate([
    { "$group": {
        "_id": {
            "addr": "$addr",
            "book": "$book"
        },
        "bookCount": { "$sum": 1 }
    }},
    { "$group": {
        "_id": "$_id.addr",
        "books": { 
            "$push": { 
                "book": "$_id.book",
                "count": "$bookCount"
            },
        },
        "count": { "$sum": "$bookCount" }
    }},
    { "$sort": { "count": -1 } },
    { "$limit": 2 },
    { "$unwind": "$books" },
    { "$sort": { "count": 1, "books.count": -1 } },
    { "$group": {
        "_id": "$_id",
        "books": { "$push": "$books" },
        "count": { "$first": "$count" }
    }},
    { "$project": {
        "_id": {
            "_id": "$_id",
            "books": "$books",
            "count": "$count"
        },
        "newBooks": "$books"
    }},
    { "$unwind": "$newBooks" },
    { "$group": {
      "_id": "$_id",
      "num1": { "$first": "$newBooks" }
    }},
    { "$project": {
        "_id": "$_id",
        "newBooks": "$_id.books",
        "num1": 1
    }},
    { "$unwind": "$newBooks" },
    { "$project": {
        "_id": "$_id",
        "num1": 1,
        "newBooks": 1,
        "seen": { "$eq": [
            "$num1",
            "$newBooks"
        ]}
    }},
    { "$match": { "seen": false } },
    { "$group":{
        "_id": "$_id._id",
        "num1": { "$first": "$num1" },
        "num2": { "$first": "$newBooks" },
        "count": { "$first": "$_id.count" }
    }},
    { "$project": {
        "num1": 1,
        "num2": 1,
        "count": 1,
        "type": { "$cond": [ 1, [true,false],0 ] }
    }},
    { "$unwind": "$type" },
    { "$project": {
        "books": { "$cond": [
            "$type",
            "$num1",
            "$num2"
        ]},
        "count": 1
    }},
    { "$group": {
        "_id": "$_id",
        "count": { "$first": "$count" },
        "books": { "$push": "$books" }
    }},
    { "$sort": { "count": -1 } }
])

So that will actually give you the top 2 "books" from the top two "address" entries.

But for my money, stay with the first form and then simply "slice" the elements of the array that are returned to take the first "N" elements.


Demonstration Code

The demonstration code is appropriate for usage with current LTS versions of NodeJS from v8.x and v10.x releases. That's mostly for the async/await syntax, but there is nothing really within the general flow that has any such restriction, and adapts with little alteration to plain promises or even back to plain callback implementation.

index.js

const { MongoClient } = require('mongodb');
const fs = require('mz/fs');

const uri = 'mongodb://localhost:27017';

const log = data => console.log(JSON.stringify(data, undefined, 2));

(async function() {

  try {
    const client = await MongoClient.connect(uri);

    const db = client.db('bookDemo');
    const books = db.collection('books');

    let { version } = await db.command({ buildInfo: 1 });
    version = parseFloat(version.match(new RegExp(/(?:(?!-).)*/))[0]);

    // Clear and load books
    await books.deleteMany({});

    await books.insertMany(
      (await fs.readFile('books.json'))
        .toString()
        .replace(/\n$/,"")
        .split("\n")
        .map(JSON.parse)
    );

    if ( version >= 3.6 ) {

    // Non-correlated pipeline with limits
      let result = await books.aggregate([
        { "$group": {
          "_id": "$addr",
          "count": { "$sum": 1 }
        }},
        { "$sort": { "count": -1 } },
        { "$limit": 2 },
        { "$lookup": {
          "from": "books",
          "as": "books",
          "let": { "addr": "$_id" },
          "pipeline": [
            { "$match": {
              "$expr": { "$eq": [ "$addr", "$$addr" ] }
            }},
            { "$group": {
              "_id": "$book",
              "count": { "$sum": 1 },
            }},
            { "$sort": { "count": -1 } },
            { "$limit": 2 }
          ]
        }}
      ]).toArray();

      log({ result });
    }

    // Serial result procesing with parallel fetch

    // First get top addr items
    let topaddr = await books.aggregate([
      { "$group": {
        "_id": "$addr",
        "count": { "$sum": 1 }
      }},
      { "$sort": { "count": -1 } },
      { "$limit": 2 }
    ]).toArray();

    // Run parallel top books for each addr
    let topbooks = await Promise.all(
      topaddr.map(({ _id: addr }) =>
        books.aggregate([
          { "$match": { addr } },
          { "$group": {
            "_id": "$book",
            "count": { "$sum": 1 }
          }},
          { "$sort": { "count": -1 } },
          { "$limit": 2 }
        ]).toArray()
      )
    );

    // Merge output
    topaddr = topaddr.map((d,i) => ({ ...d, books: topbooks[i] }));
    log({ topaddr });

    client.close();

  } catch(e) {
    console.error(e)
  } finally {
    process.exit()
  }

})()

books.json

{ "addr": "address1",  "book": "book1"  }
{ "addr": "address2",  "book": "book1"  }
{ "addr": "address1",  "book": "book5"  }
{ "addr": "address3",  "book": "book9"  }
{ "addr": "address2",  "book": "book5"  }
{ "addr": "address2",  "book": "book1"  }
{ "addr": "address1",  "book": "book1"  }
{ "addr": "address15", "book": "book1"  }
{ "addr": "address9",  "book": "book99" }
{ "addr": "address90", "book": "book33" }
{ "addr": "address4",  "book": "book3"  }
{ "addr": "address5",  "book": "book1"  }
{ "addr": "address77", "book": "book11" }
{ "addr": "address1",  "book": "book1"  }

Using aggregate function like below :

[
{$group: {_id : {book : '$book',address:'$addr'}, total:{$sum :1}}},
{$project : {book : '$_id.book', address : '$_id.address', total : '$total', _id : 0}}
]

it will give you result like following :

        {
            "total" : 1,
            "book" : "book33",
            "address" : "address90"
        }, 
        {
            "total" : 1,
            "book" : "book5",
            "address" : "address1"
        }, 
        {
            "total" : 1,
            "book" : "book99",
            "address" : "address9"
        }, 
        {
            "total" : 1,
            "book" : "book1",
            "address" : "address5"
        }, 
        {
            "total" : 1,
            "book" : "book5",
            "address" : "address2"
        }, 
        {
            "total" : 1,
            "book" : "book3",
            "address" : "address4"
        }, 
        {
            "total" : 1,
            "book" : "book11",
            "address" : "address77"
        }, 
        {
            "total" : 1,
            "book" : "book9",
            "address" : "address3"
        }, 
        {
            "total" : 1,
            "book" : "book1",
            "address" : "address15"
        }, 
        {
            "total" : 2,
            "book" : "book1",
            "address" : "address2"
        }, 
        {
            "total" : 3,
            "book" : "book1",
            "address" : "address1"
        }

I didn't quite get your expected result format, so feel free to modify this to one you need.


Below query will provide exactly the same result as given in the desired response:

db.books.aggregate([
    {
        $group: {
            _id: { addresses: "$addr", books: "$book" },
            num: { $sum :1 }
        }
    },
    {
        $group: {
            _id: "$_id.addresses",
            bookCounts: { $push: { bookName: "$_id.books",count: "$num" } }
        }
    },
    {
        $project: {
            _id: 1,
            bookCounts:1,
            "totalBookAtAddress": {
                "$sum": "$bookCounts.count"
            }
        }
    }

]) 

The response will be looking like below:

/* 1 */
{
    "_id" : "address4",
    "bookCounts" : [
        {
            "bookName" : "book3",
            "count" : 1
        }
    ],
    "totalBookAtAddress" : 1
},

/* 2 */
{
    "_id" : "address90",
    "bookCounts" : [
        {
            "bookName" : "book33",
            "count" : 1
        }
    ],
    "totalBookAtAddress" : 1
},

/* 3 */
{
    "_id" : "address15",
    "bookCounts" : [
        {
            "bookName" : "book1",
            "count" : 1
        }
    ],
    "totalBookAtAddress" : 1
},

/* 4 */
{
    "_id" : "address3",
    "bookCounts" : [
        {
            "bookName" : "book9",
            "count" : 1
        }
    ],
    "totalBookAtAddress" : 1
},

/* 5 */
{
    "_id" : "address5",
    "bookCounts" : [
        {
            "bookName" : "book1",
            "count" : 1
        }
    ],
    "totalBookAtAddress" : 1
},

/* 6 */
{
    "_id" : "address1",
    "bookCounts" : [
        {
            "bookName" : "book1",
            "count" : 3
        },
        {
            "bookName" : "book5",
            "count" : 1
        }
    ],
    "totalBookAtAddress" : 4
},

/* 7 */
{
    "_id" : "address2",
    "bookCounts" : [
        {
            "bookName" : "book1",
            "count" : 2
        },
        {
            "bookName" : "book5",
            "count" : 1
        }
    ],
    "totalBookAtAddress" : 3
},

/* 8 */
{
    "_id" : "address77",
    "bookCounts" : [
        {
            "bookName" : "book11",
            "count" : 1
        }
    ],
    "totalBookAtAddress" : 1
},

/* 9 */
{
    "_id" : "address9",
    "bookCounts" : [
        {
            "bookName" : "book99",
            "count" : 1
        }
    ],
    "totalBookAtAddress" : 1
}