I have a query:
db.test.aggregate( {$group : { _id : '$key', frequency: { $sum : 1 } } } )
This will get the frequency of every enumeration of key in the test set. Basically, I have gotten the distribution of key.
Now imagine I want to get the distributions of key1, key2, and key3 (so three different distributions).
Obviously, I could run this query 3 times with each separate key, but it seems like we would be able to optimize the query by allowing it to count all 3 keys at the same time. I have been playing around with it and searching the whole of the inter-webs, but so far, I am consigned to running three separate aggregation queries or using a map/reduce function.
Does anyone have any other ideas?
The MongoDB $count operator allows us to pass a document to the next phase of the aggregation pipeline that contains a count of the documents. There a couple of important things to note about this syntax: First, we invoke the $count operator and then specify the string.
collection. aggregate () can use several channels at the same time for data processing.
First stage $project is to turn all keys into array to count fields. Second stage $group is to sum the number of keys/fields in the collection, also the number of documents processed. Third stage $project is subtracting the total number of fields with the total number of documents (As you don't want to count for _id ).
The _id expression specifies the group key. If you specify an _id value of null, or any other constant value, the $group stage returns a single document that aggregates values across all of the input documents. See the Group by Null example.
There are a few different approaches you could use here:
Use map/reduce: don't do this. Right now it would be much faster to run the aggregation framework 3 times than to use a map reduce function for this use case.
Run aggregation 3 times. This is not optimal, but if you don't have time constraints then this is the easiest option. If your aggregations are taking < a few seconds anyway then I wouldn't worry about optimizing until they become a problem.
Here's the best work-around I can think of. The $group
operator allows you to build an _id
on multiple fields. E.g. {"_id":{"a":"$key1", "b":"$key2", "c":"$key3"}}
. Doing this creates a grouping for all existing combinations of your different keys. You could potentially group you keys this way and then manually sum across the results in the client.
Let me elaborate. Let's say we have a collection of shapes. These shapes can have a color, a size, and a kind (square, circle, etc). An aggregation on a multi-key Id could look like:
db.shapes.aggregate({$group:{_id:{"f1":"$f1", "f2":"$f2", "f3":"$f3"}, count:{"$sum":1}}})
and return:
"result" : [
{
"_id" : {
"f1" : "yellow",
"f2" : "medium",
"f3" : "triangle"
},
"count" : 4086
},
{
"_id" : {
"f1" : "red",
"f2" : "small",
"f3" : "triangle"
},
"count" : 4138
},
{
"_id" : {
"f1" : "red",
"f2" : "big",
"f3" : "square"
},
"count" : 4113
},
{
"_id" : {
"f1" : "yellow",
"f2" : "small",
"f3" : "triangle"
},
"count" : 4145
},
{
"_id" : {
"f1" : "red",
"f2" : "small",
"f3" : "square"
},
"count" : 4062
}
... and so on
You would then sum up the results client-side, over a drastically reduced number of entries. Assuming the number of unique values for each key is sufficiently small compared to the total number of documents, you could do this final step in a negligible amount of time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With