Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MongoDB 2.2 Aggregation Framework group by field name

Tags:

mongodb

Is it possible to group-by field name? Or do I need a different structure so I can group-by value?

I know we can use group by on values and we can unwind arrays, but is it possible to get total apples, pears and oranges owned by John amongst the three houses here without specifying "apples", "pears" and "oranges" explicitly as part of the query? (so NOT like this);

// total all the fruit John has at each house
db.houses.aggregate([
    {
        $group: {
            _id: null,

            "apples":  { $sum: "$people.John.items.apples" },
            "pears":   { $sum: "$people.John.items.pears" }, 
            "oranges": { $sum: "$people.John.items.oranges" }, 
        }
    },
])

In other words, can I group-by the first field-name under "items" and get the aggregate sum of apples:104, pears:202 and oranges:306, but also bananas, melons and anything else that might be there? Or do I need to restructure the data into an array of key/value pairs like categories?

db.createCollection("houses");
db.houses.remove();
db.houses.insert(
[
    {
        House: "birmingham",
        categories : [
            {
                k : "location",
                v : { d : "central" }
            }
        ],
        people: {
            John: {
                items: {
                    apples: 2,
                    pears: 1,
                    oranges: 3,
                }
            },
            Dave: {
                items: {
                    apples: 30,
                    pears: 20,
                    oranges: 10,
                },
            },
        },
    },
    {
        House: "London", categories: [{ k: "location", v: { d: "central" } }, { k: "type", v: { d: "rented" } }],
        people: {
            John: { items: { apples: 2, pears: 1, oranges: 3, } },
            Dave: { items: { apples: 30, pears: 20, oranges: 10, }, },
        },
    },
    {
        House: "Cambridge", categories: [{ k: "type", v: { d: "rented" } }],
        people: {
            John: { items: { apples: 100, pears: 200, oranges: 300, } },
            Dave: { items: { apples: 0.3, pears: 0.2, oranges: 0.1, }, },
        },
    },
]
);

Secondly, and more importantly, could I then also group by "house.categories.k" ? In other words, is it possible to find out how many "apples" "John" has in "rented" vs "owned" or "friends" houses (so group by "categories.k.type")?

Finally - if this is even possible, is it sensible? At first I thought it was quite useful to create dictionaries of nested objects using actual field names of the object, as it seemed a logical use of a document database, and it seemed to make the MR queries easier to write vs arrays, but now I'm starting to wonder if this is all a bad idea and having variable field names makes it very tricky/inefficient to write aggregation queries.

like image 796
cirrus Avatar asked Mar 05 '26 20:03

cirrus


2 Answers

OK, so I think I have this partially solved. At least for the shape of data in the initial question.

// How many of each type of fruit does John have at each location
db.houses.aggregate([
    {
        $unwind: "$categories"
    },
    {
        $match: { "categories.k": "location" }
    },
    {
        $group: {
            _id: "$categories.v.d",
            "numberOf": { $sum: 1 },
            "Total Apples": { $sum: "$people.John.items.apples" },
            "Total  Pears": { $sum: "$people.John.items.pears" },
        }
    },
])

which yields;

{
        "result" : [
                {
                        "_id" : "central",
                        "numberOf" : 2,
                        "Total Apples" : 4,
                        "Total  Pears" : 2
                }
        ],
        "ok" : 1
}

Note that there's only "central", but if I had other "location"s in my DB I'd get a range of totals for each location. I wouldn't need the $unwind step if I had named properties instead of an array for "categories", but this is where I find the structure is at odds with itself. There are several keywords likely under "categories". The sample data shows "type" and "location" but there could be around 10 of these categorizations all with different values. So if I used named fields;

"categories": {
  location: "london",
  type: "owned",
}

...the problem I then have is indexing. I can't afford to simply index "location" since those are user-defined categories, and if 10,000 users choose 10,000 different ways of categorizing their houses I'd need 10,000 indexes, one for each field. But by making it an array I only need one on the array field itself. The downside is the $unwind step. I ran into this before with MapReduce. The last thing you want to be doing is a ForEach loop in JavaScript to cycle an array if you can help it. What you really want is to filter out the fields by name because it's much quicker.

Now this is all well and good where I already know what fruit I'm looking for, but if I don't, it's much harder. I can't (as far as I can see) $unwind or otherwise ForEach "people.John.items" here. If I could, I'd be overjoyed. So since the names of fruit are again user-defined, it looks like I need to convert them to an array as well, like this;

{
    "people" : {
        "John" : {
            "items" : [
                { k:"apples", v:100 },
                { k:"pears", v:200 },
                { k:"oranges", v:300 },
            ]
        },
    }
}

So that now allows me get the fruit (where I don't know which fruit to look for) totalled, again by location;

db.houses.aggregate([
    {
        $unwind: "$categories"
    },
    {
        $match: { "categories.k": "location" }
    },
    {
        $unwind: "$people.John.items" 
    },
    {
        $group: { // compound key - thanks to Jenna
            _id: { fruit:"$people.John.items.k", location:"$categories.v.v" },
            "numberOf": { $sum: 1 },
            "Total Fruit": { $sum: "$people.John.items.v" },
        }
    },
])

So now I'm doing TWO $unwinds. If you're thinking that looks grotesquely ineffecient, you'd be right. If I have just 10,000 house records, with 10 categories each, and 10 types of fruit, this query takes half a minute to run. OK, so I can see that moving the $match before the $unwind improves things significantly but then it's the wrong output. I don't want an entry for every category, I want to filter out just the "location" categories.

like image 157
cirrus Avatar answered Mar 07 '26 08:03

cirrus


I would have made this comment, but it's easier to format in a response text box.

{ _id: 1,
  house: "New York",
  people: {
      John: {
          items: {apples: 1, oranges:2}
      }
      Dave: {
          items: {apples: 2, oranges: 1}
      }
  }
}

{ _id: 2,
      house: "London",
      people: {
          John: {
              items: {apples: 3, oranges:2}
          }
          Dave: {
              items: {apples: 1, oranges:3}
          }
      }
}

Just to make sure I understand your question, is this what you're trying to accomplish?

{location: "New York", johnFruit:3}
{location: "London", johnFruit: 5}

Since categories is not nested under house, you can't group by "house.categories.k", but you can use a compound key for the _id of $group to get this result:

{ $group: _id: {house: "$House", category: "$categories.k"} 

Although "k" doesn't contain the information that you're presumably trying to group by. And as for "categories.k.type", type is the value of k, so you can't use this syntax. You would have to group by "categories.v.d".

It may be possible with your current schema to accomplish this aggregation using $unwind, $project, possibly $match, and finally $group, but the command won't be pretty. If possible, I would highly recommend restructuring your data to make this aggregation much simpler. If you would like some help with schema, please let us know.

like image 29
Jenna Avatar answered Mar 07 '26 09:03

Jenna



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!