Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get (or aggregate) distinct keys of array in MongoDB

I'm trying to get MongoDB to aggregate for me over an array with different key-value pairs, without knowing keys (Just a simple sum would be ok.)

Example docs:

{data: [{a: 3}, {b: 7}]}
{data: [{a: 5}, {c: 12}, {f: 25}]}
{data: [{f: 1}]}
{data: []}

So basically each doc (or it's array really) can have 0 or many entries, and I don't know the keys for those objects, but I want to sum and average the values over those keys.

Right now I'm just loading a bunch of docs and doing it myself in Node, but I'd like to offload that work to MongoDB.

I know I can unwind those first, but how to proceed from there? How to sum/avg/min/max the values if I don't know the keys?

like image 664
Zlatko Avatar asked Jun 27 '15 02:06

Zlatko


People also ask

How do I get distinct records in MongoDB?

In MongoDB, the distinct() method finds the distinct values for a given field across a single collection and returns the results in an array. It takes three parameters first one is the field for which to return distinct values and the others are optional.

How do I filter distinct values in MongoDB?

To get unique values and ignore duplicates, use distinct() in MongoDB. The distinct() finds the distinct values for a specified field across a single collection and returns the results in an array.

What are the differences between using aggregate () and find ()? In MongoDB?

With aggregate + $match, you get a big monolithic BSON containing all matching documents. With find, you get a cursor to all matching documents. Then you can get each document one by one.

Can we use count with aggregate function in MongoDB?

MongoDB $count AggregationThe MongoDB $count operator allows us to pass a document to the next phase of the aggregation pipeline that contains a count of the documents. There a couple of important things to note about this syntax: First, we invoke the $count operator and then specify the string.


1 Answers

If you do not know the keys or cannot make a reasonable educated guess then you are basically stuck from going any further with the aggregation framework. You could supply "all of the keys" for consideration, but I supect your acutal data looks more like this:

{ "data": [{ "film": 10 }, { "televsion": 5 },{ "boardGames": 1 }] }

So there would be little point here findin out all the "key names" and then throwing that at an aggregation statement.

For the record though, "this is why you do not structure your data storage like this". Information like "film" here should not be used as a "key" name, because it is useful "data" that could be searched upon and most importantly "indexed" in a database system.

So your data should really look like this:

{ 
    "data": [
        { "type": "film", "value": 10 },
        { "type": "televsion", "valule": 5 },
        { "type": "boardGames", "value": 1 }
    ]
}

Then the aggregation statement is simple, as are many other things:

db.collection.aggregate([
    { "$unwind": "$data" },
    { "$group": {
        "_id": null,
        "sum": { "$sum": "$data.value" },
        "avg": { "$avg": "$data.value" }
    }}
])

But since the key names are constantly changing in documents and do not have a uniform structure, then you need JavaScript processing on the server to traverse the keys, and that meand mapReduce:

db.collection.mapReduce(
    function() {
        this.data.forEach(function(data) {
            Object.keys(data).forEach(function(key) {
                emit(null,data[key]); // emit the value regardless of key name
            });
        });
    },
    function(key,values) {
        return Array.sum(values);     // Just summing for example
    },
    { "out": { "inline": 1 } }
)

And of course the JavaScript execution here will work much more slowly than the native coded operators available to the aggregation framework.

So this should be an abject lesson as to why you don not use "data" as "key names" when storing data in a database. The aggregation framework works with standard structres and is fast, falling back to JavaScript processing is more flexible, but the cost is mostly in speed and other features.

like image 164
Blakes Seven Avatar answered Oct 10 '22 19:10

Blakes Seven