I have a mongodb of about 400gb. The documents contain a variety of fields, but the key here is an array of IDs.
So a json file might look like this
{
"name":"bob"
"dob":"1/1/2011"
"key":
[
"1020123123",
"1234123222",
"5021297723"
]
}
The focal variable here is "key". There is about 10 billion total keys across 50 million documents (so each document has about 200 keys). Keys can repeat, and there are about 15 million UNIQUE keys.
What I would like to do is return the 10,000 most common keys. I thought aggregate might do this, but I'm having a lot of trouble getting it to run. Here is my code:
db.users.aggregate(
[
{ $unwind : "$key" },
{ $group : { _id : "$key", number : { $sum : 1 } } },
{ $sort : { number : -1 } },
{ $limit : 10000 }
]
);
Any ideas what I'm doing wrong?
On large collections of millions of documents, MongoDB's aggregation was shown to be much worse than Elasticsearch. Performance worsens with collection size when MongoDB starts using the disk due to limited system RAM. The $lookup stage used without indexes can be very slow.
Aggregation is slow - Working with Data - MongoDB Developer Community Forums.
The pipeline provides efficient data aggregation using native operations within MongoDB, and is the preferred method for data aggregation in MongoDB. The aggregation pipeline can operate on a sharded collection.
If the value is an array, $push appends the whole array as a single element. To add each element of the value separately, use the $each modifier with $push .
Try this:
db.users.aggregate(
[
{ $unwind : "$key" },
{ $group : { _id : "$key", number : { $sum : 1 } } },
{ $sort : { number : -1 } },
{ $limit : 10000 },
{ $out:"result"},
], {
allowDiskUse:true,
cursor:{}
}
);
Then find result by db.result.find()
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With