Is the aggregation framework introduced in mongodb 2.2, has any special performance improvements over map/reduce?
If yes, why and how and how much?
(Already I have done a test for myself, and the performance was nearly same)
On large collections of millions of documents, MongoDB's aggregation was shown to be much worse than Elasticsearch. Performance worsens with collection size when MongoDB starts using the disk due to limited system RAM. The $lookup stage used without indexes can be very slow.
MapReduce of MongoDB is based on JavaScript using the SpiderMonkey engine and the queries are executed in a single thread. On the other hand, Aggregation Pipeline queries run on compiled C++ code which makes them faster as it is not interpreted like JavaScript.
Map-reduce operations can be rewritten using aggregation pipeline operators, such as $group , $merge , and others. For map-reduce operations that require custom functionality, MongoDB provides the $accumulator and $function aggregation operators starting in version 4.4.
The pipeline provides efficient data aggregation using native operations within MongoDB, and is the preferred method for data aggregation in MongoDB. The aggregation pipeline can operate on a sharded collection. The aggregation pipeline can use indexes to improve its performance during some of its stages.
Every test I have personally run (including using your own data) shows aggregation framework being a multiple faster than map reduce, and usually being an order of magnitude faster.
Just taking 1/10th of the data you posted (but rather than clearing OS cache, warming the cache first - because I want to measure performance of the aggregation, and not how long it takes to page in the data) I got this:
MapReduce: 1,058ms
Aggregation Framework: 133ms
Removing the $match from aggregation framework and {query:} from mapReduce (because both would just use an index and that's not what we want to measure) and grouping the entire dataset by key2 I got:
MapReduce: 18,803ms
Aggregation Framework: 1,535ms
Those are very much in line with my previous experiments.
My benchmark:
== Data Generation ==
Generate 4million rows (with python) easy with approximately 350 bytes. Each document has these keys:
Total data size was about 6GB in mongo. (and 2GB in postgres) db = Connection('127.0.0.1').test # mongo connection random.seed(1) for _ in range(2): key1s = [hexlify(os.urandom(10)).decode('ascii') for _ in range(10)] key2s = [hexlify(os.urandom(10)).decode('ascii') for _ in range(1000)] baddata = 'some long date ' + '*' * 300 for i in range(2000): data_list = [{ 'key1': random.choice(key1s), 'key2': random.choice(key2s), 'baddata': baddata, 'value': 10, } for _ in range(1000)] for data in data_list: db.testtable.save(data)
== Tests ==
I did some test, but one is enough to comparing results:
NOTE: Server is restarted, and OS cache is cleaned after each query, to ignore effect of caching.
QUERY: aggregate all rows with key1=somevalue
(about 200K rows) and sum value
for each key2
queries:
map/reduce:
db.testtable.mapReduce(function(){emit(this.key2, this.value);}, function(key, values){var i =0; values.forEach(function(v){i+=v;}); return i; } , {out:{inline: 1}, query: {key1: '663969462d2ec0a5fc34'} })
aggregate:
db.testtable.aggregate({ $match: {key1: '663969462d2ec0a5fc34'}}, {$group: {_id: '$key2', pop: {$sum: '$value'}} })
group:
db.testtable.group({key: {key2:1}, cond: {key1: '663969462d2ec0a5fc34'}, reduce: function(obj,prev) { prev.csum += obj.value; }, initial: { csum: 0 } })
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With