MongoDB Aggregation Framework performance slow over millions of documents

background

Our system is carrier grade and extremely robust, it has been load tested to handle 5000 transactions per second, and for each transaction a document is inserted into a single MongoDB collection (no updates or queries in this application, it is write-only). That amounts to ~700MM documents per day which is our benchmark.

The MongoDB deployment is not yet sharded, we have 1x replicaset with 1 master and 2 slaves all of which are type m2.2xlarge instances on ec2. Each instance is backed by a 1TB RAID0 stripe consisting of 8 volumes (no PIOPS). We are using the node-mongodb-native driver with c++ native BSON parser for optimal write performance and have attempted to model the document structure accordingly.

note

Documents are tiny (120 bytes)
The document includes a “time bucket” (h[our], d[ay], m[onth], y[ear]) along with the “t[ime]” field
We have an index on the collection to query by “c[ustomer]” and “a” which is a highly random but non-unique tag
We have looked into partitioning data into separate collections, though in this example all data is hot.
We are also looking into pre-aggregation though this cannot be done in realtime.

requirement

For reporting we need to calculate the amount of unique “a” tags per month, along with their totals by customer over any given period
A report takes about 60sec to run over a sample (full collection) of 9.5MM documents stored over 2 hours. Details below:

document

{
  _id: ObjectID(),
  a: ‘string’,
  b: ‘string’,
  c: ‘string’ or <int>,
  g: ‘string’ or <not_exist>,
  t: ISODate(),
  h: <int>,
  d: <int>,
  m: <int>,
  y: <int>
}

index

col.ensureIndex({ c: 1, a: 1, y: 1, m: 1, d: 1, h: 1 });

aggregation query

col.aggregate([
    { $match: { c: 'customer_1', y: 2013, m: 11 } },
    { $group: { _id: { c: '$c', y: '$y', m: '$m' }, a: { $addToSet: '$a' }, t: { $sum: 1 } } },
    { $unwind: '$a' },
    { $group: { _id: { c: '$_id.c', y: '$_id.y', m: '$_id.m', t: '$t' }, a: { $sum: 1 } } },
    { $sort: { '_id.m': 1 } },
    {
        $project: {
            _id: 0,
            c: '$_id.c',
            y: '$_id.y', 
            m: '$_id.m',
            a: 1,
            t: '$_id.t'
        }
    },
    { $group: { _id: { c: '$c', y: '$y' }, monthly: { $push: { m: '$m', a: '$a', t: '$t' } } } },
    { $sort: { '_id.y': 1 } },
    {
        $project: {
            _id: 0,
            c: '$_id.c',
            y: '$_id.y', 
            monthly: 1
        }
    },
    { $group: { _id: { c: '$c' }, yearly: { $push: { y: '$y', monthly: '$monthly' } } } },
    { $sort: { '_id.c': 1 } },
    {
        $project: {
            _id: 0,
            c: '$_id.c',
            yearly: 1
        }
    }    
]);

aggregation result

[
    {
        "yearly": [
            {
                "y": 2013,
                "monthly": [
                    {
                        "m": 11,
                        "a": 3465652,
                        "t": 9844935
                    }
                ]
            }
        ],
        "c": "customer_1"
    }
]

63181ms

aggregation explain

{
        "cursor" : "BtreeCursor c_1_a_1_y_1_m_1_d_1_h_1",
        "isMultiKey" : false,
        "n" : 9844935,
        "nscannedObjects" : 0,
        "nscanned" : 9844935,
        "nscannedObjectsAllPlans" : 101,
        "nscannedAllPlans" : 9845036,
        "scanAndOrder" : false,
        "indexOnly" : true,
        "nYields" : 27,
        "nChunkSkips" : 0,
        "millis" : 32039,
        "indexBounds" : {
                "c" : [ [ "customer_1", "customer_1" ] ],
                "a" : [ [ { "$minElement" : 1 }, { "$maxElement" : 1 } ] ],
                "y" : [ [ 2013, 2013 ] ],
                "m" : [ [ 11, 11 ] ],
                "d" : [ [ { "$minElement" : 1 }, { "$maxElement" : 1 } ] ],
                "h" : [ [ { "$minElement" : 1 }, { "$maxElement" : 1 } ] ]
        }
}

questions

Given the high frequency of inserts, and our need to perform ranged aggregation queries over time. Is the time bucket good practice considering the application can insert 30MM documents in a single hour period?
We were of the understanding that MongoDB can query billions of documents in seconds:
- Surely our aggregation query over 9.5MM documents could return in 1sec or so?
- Are we using the right technique to achieve this and if not where should we be focusing our efforts to getting report results almost instantly?
- Is it possible without sharding at this stage?
Would MapReduce (parallel) be a better alternative?

848

asked Nov 14 '13 15:11

Ashley Brener

1 Answers

You wonder why your aggregation is taking so "long". Aside from the points made by @Avish in his answer (you are doing some unnecessary steps) you also have to consider your physical resources and where the time is going.

Here is part of your "explain":

    "n" : 9844935,
    "nscannedObjects" : 0,
    "nscanned" : 9844935,
    "scanAndOrder" : false,
    "indexOnly" : true,
    "nYields" : 27,
    "millis" : 32039,

Things of note are the fact that the aggregation took 32 seconds (not 50), it never had to fetch a single document as it got all of the information from the index itself. It didn't have to do any in-memory sorts. But it did have to yield 27 times... Why is that? There are two reasons that reader processes yield - one is when there is a write waiting (writers take precedence and long running reads have to yield to them) or there was a page fault - all operations must yield when any data they are trying to access is not in RAM (this is to prevent a process from blocking others from doing work while they are waiting for their data to be loaded into RAM).

Questions that come to mind are: was the DB cold? Do the indexes fit in RAM? Were there writes happening at the same time that the reads had to contend with?

I would check that the index can fit in RAM, run the "touch" command to make sure it is in RAM, simplify my aggregation pipeline not to do unnecessary work and then run it again, a couple of times in a row and see how the timings look then.

answered Oct 05 '22 05:10

Asya Kamsky

Related questions
                            
                                Mongo sort by string value that is actually number
                            
                                MongoDB ODM SELECT COUNT(*) equivalent
                            
                                Which of the following statements are true about MongoDB replication? [closed]
                            
                                PHP Startup Unable to load dynamic library php_mongo.dll
                            
                                Why does db.insert(dict) add _id key to the dict object while using pymongo
                            
                                How to click on a table entry and search for it in another mongodb collection ( php / jquery )
                            
                                How to implement optimization of INNER JOINS (push down) for Mongo Storage Plugin in Apache Drill?
                            
                                Storing very large documents in MongoDB
                            
                                Timeout when attempting to connect to mongo from jest unit tests
                            
                                Repetitive Node.js requests to MongoDB slow down eventually
                            
                                How do I use a geospatial query in the 2.1 MongoDB C# driver?
                            
                                MongoDB Enterprise failing to start with Wired Tiger as service
                            
                                Sharing data between Mongoose middleware methods pre save and post save
                            
                                MongoDB tips, tricks and gotchas [closed]
                            
                                Sort Sub Documents in MongoDB
                            
                                How do I conditionally upsert a document in mongo?
                            
                                MongoDB 2dsphere index $geoWithin performance
                            
                                How do I detect whether a mongodb serializer is already registered?
                            
                                Mongo Bulk Insert across multiple collections

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

MongoDB Aggregation Framework performance slow over millions of documents

Tags:

indexing

mongodb

aggregation-framework

mapreduce