Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Increasing memory limit for mongodb aggregate jobs

Based on the aggregation pipeline docs, "any single aggregation operation consumes more than 10 percent of system RAM, the operation will produce an error." - http://docs.mongodb.org/manual/core/aggregation-pipeline-limits/

Is there any way of increasing this limit? I have also set allowDiskUse: true (so the error is no longer an issue), but would like to use more RAM to improve performance.

Background: I am running a large aggregate job on mongodb on about 100 million entries. It is basically a massive call to $group to merge the entries based on a key.

I am using the dev release of mongo v 2.6.0-rc2 (3/21/2014)

like image 863
Jeff Tsui Avatar asked Mar 29 '14 01:03

Jeff Tsui


People also ask

How do I set memory limit in MongoDB?

MongoDB, in its default configuration, will use will use the larger of either 256 MB or ½ of (ram – 1 GB) for its cache size. You can limit the MongoDB cache size by adding the cacheSizeGB argument to the /etc/mongod. conf configuration file, as shown below.

Is MongoDB good for aggregate?

As with many other database systems, MongoDB allows you to perform a variety of aggregation operations. These allow you to process data records in a variety of ways, such as grouping data, sorting data into a specific order, or restructuring returned documents, as well as filtering data as one might with a query.

Is aggregation fast in MongoDB?

On large collections of millions of documents, MongoDB's aggregation was shown to be much worse than Elasticsearch. Performance worsens with collection size when MongoDB starts using the disk due to limited system RAM. The $lookup stage used without indexes can be very slow.

How much memory is needed for MongoDB?

MongoDB requires approximately 1 GB of RAM per 100.000 assets. If the system has to start swapping memory to disk, this will have a severely negative impact on performance and should be avoided.


2 Answers

There are 3 solutions I can think of:

Solution 1 (Mongodb): $sort by id. Use $group only to count the amount of documents per unique id. Then use $facet with $limit and $skip to split up the operation. I haven't done that, but I am about to try since I am experiencing the same problem as you.

Solution 1 (Mongodb): Depending on the queries of your use case and the number of distinct id's you could create a collection for each id. It's ugly, but if each request is only targetting specific id's it might make sense. Especially with a low number of id's there might be no other workaround since there is a hard limit to document size as well (16MB I think). A large enough document array nested inside a document is not an option I suppose.

Solution 3: This solution would be writing your own highly customized database solution. Allocate memory, ideally a large sequential block. Lay out all the data it can carry in struct arrays within that buffer which can remain in memory at all times or be written back to a fast disk.

Perform most operations on memory in milliseconds time.

If the data is too large to keep it in memory for an entire operation, you can map files to memory and use views of those mapped files. I'm using 2 small NVME's in another project which cost me 40€ each. Together they can write 15GB to disk in less than 3 seconds which is the size of the buffer I need for the entire dataset.

I was convinced by a friend to give mongodb a try in order to scrape wildly variable data and am currently handling a 200MB collection which will eventually grow to 100 times that size. The operation I'm performing is simply grouping into documents by id and pushing the roots into an array. Allowdiskuse is set to true, but other than the wording of the error message this seems to make no difference.

I really don't see the issue with letting a user adjust the memory limit to his needs. Even if people will say using 16GB of memory is a bad idea, there are plenty of scenarios in which it definitely is not. As adults we should all be allowed to make our own mistakes. No operation on a 200MB database should ever be an issue, especially not one as simple as the one I am performing - or failing to perform.

like image 34
seafunk Avatar answered Oct 05 '22 05:10

seafunk


Well no there is no setting and if you really think about it there is good reason for this. So if you first consider what aggregate is doing and what MongoDB does in general it should become clear.

This is what "should" be at the "head" of any sensible aggregation pipeline:

db.collection.aggregate([
    { "$match:{ /* Something here */ } },

And these are the reasons:

  1. It makes good sense to try to reduce the working set that you are operating on in any operation.

  2. This is also the only time you get the opportunity to use an index to aid in searching the selection. Which is always better than a collection scan.

  3. Even though there is a built in "optimizer" that looks for such things as "projections" limiting the "selected" fields, the best scrutineer of working set size is to only work on the valid records. Later stage matches are not "optimized" in this way.(See point 1)

The next thing to consider is the general behavior of MongoDB. So that the server process wants to do, is "consume" as much of the available machine memory as it can in order to hold the "working set" data ( collections and/or index ) in order to "work" on that data in the most efficient means.

So it really is in the "best interests" of the database engine to "spend" most of it's memory allocation in this way. As in that way, both your "aggregate" job and all of the other concurrent processes have access to the "working data" in the memory space.

So therefore it is "not optimal" for MongoDB to "steal" this memory allocation away from the other concurrent operations just to service your running aggregation operation.

In the "programming to hardware requirements" terms, well you are aware that future releases allow the aggregation pipeline to implement "disk use" in order to allow larger processing. You can always implement SSD's or other fast storage technologies. And of course "10%" of RAM is subjective to the amount of RAM that is installed in a system. So you can always increase that.

The roundup of this is, MongoDB has an actual job of being a "concurrent datastore" and does that well. What it is not is a specific "aggregation job-runner" and should not be treated as such.

So either "break-up" your workloads, or increase your hardware spec, or simply switch the large "task running" activity to something that does focus on the running job such as a Hadoop-style "mapReduce", and leave MongoDB to it's job of serving the data.

Or of course, change your design to simply "pre-aggregate" the required data somewhere "on write".

As the saying goes, "Horses for courses", or use your tools for what they were designed for.

like image 189
Neil Lunn Avatar answered Oct 05 '22 05:10

Neil Lunn