MongoDB's performance on aggregation queries

Tags:

After hearing so many good things about MongoDB's performance we decided to give Mongodb a try to solve a problem we have. I started by moving all the records we have in several mysql databases to a single collection in mongodb. This resulted in a collection with 29 Million documents (each one of them have at least 20 fields) which takes around 100 GB of space in the HD. We decided to put them all in one collection since all the documents have the same structure and we want to query and aggregate results on all those documents.

I created some indexes to match my queries otherwise even a simple count() would take ages. However, queries such as distinct() and group() still take way too long.

Example:

// creation of a compound index     db.collection.ensureIndex({'metadata.system':1, 'metadata.company':1})  // query to get all the combinations companies and systems db.collection.group({key: { 'metadata.system':true, 'metadata.company':true }, reduce: function(obj,prev) {}, initial: {} });

I took a look at the mongod log and it has a lot of lines like these (while executing the query above):

Thu Apr  8 14:40:05 getmore database.collection cid:973023491046432059 ntoreturn:0 query: {}  bytes:1048890 nreturned:417 154ms Thu Apr  8 14:40:08 getmore database.collection cid:973023491046432059 ntoreturn:0 query: {}  bytes:1050205 nreturned:414 430ms Thu Apr  8 14:40:18 getmore database.collection cid:973023491046432059 ntoreturn:0 query: {}  bytes:1049748 nreturned:201 130ms Thu Apr  8 14:40:27 getmore database.collection cid:973023491046432059 ntoreturn:0 query: {}  bytes:1051925 nreturned:221 118ms Thu Apr  8 14:40:30 getmore database.collection cid:973023491046432059 ntoreturn:0 query: {}  bytes:1053096 nreturned:250 164ms ... Thu Apr  8 15:04:18 query database.$cmd ntoreturn:1 command  reslen:4130 1475894ms

This query took 1475894ms which is way longer than what I would expect (the result list has around 60 entries). First of all, is this expected given the large number of documents in my collection? Are aggregation queries in general expected to be so slow in mongodb? Any thoughts on how can I improve the performance?

I am running mongod in a single machine with a dual core and 10GB of memory.

Thank you.

517

asked Apr 08 '10 12:04

Mario Duarte

2 Answers

The idea is that you improve the performance of aggregation queries by using MapReduce on a sharded database that is distributed over multiple machines.

I did some comparisons of the performance of Mongo's Mapreduce with a group-by-select statement in Oracle on the same machine. I did find that Mongo was approximately 25 times slower. This means that I have to shard the data over at least 25 machines to get the same performance with Mongo as Oracle delivers on a single machine. I used a collection/table with approximately 14 million documents/rows.

Exporting the data from mongo via mongoexport.exe and using the exported data as an external table in Oracle and doing a group-by in Oracle was much faster than using Mongo's own MapReduce.

135

answered Oct 03 '22 20:10

TTT

Couple things.

1) Your group query is processing lots of data. While your result set is small, it looks like it's doing a table scale of all of the data in your collection in order to generate that small result. This is probably the root cause of the slowness. To speed this up, you might want to look at the disk performance of your server through iostat while the query is running as that is likely the bottleneck.

2) As has been pointed out in other answers, the group command uses the javascript interpreter, which is going to limit performance. You might try using the new aggregation framework that is released as beta in 2.1 (note: this is an unstable release as of Feb 24 2012). See http://blog.mongodb.org/post/16015854270/operations-in-the-new-aggregation-framework for a good introduction. This won't overcome data volume problem in (1), but it is implemented in C++ and if javascript time is the bottleneck, then it should be much faster.

3) Another approach would be to use incremental map-reduce to generate a second collection with your grouped results. The idea is that you'd run a map-reduce job to aggregate your results once, and then periodically run another map-reduce job that re-reduces new data into the existing collection. Then you can query this second collection from your app rather than running a group command every time.

answered Oct 03 '22 20:10

jared

Related questions
                            
                                Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake
                            
                                Why are interface projections much slower than constructor projections and entity projections in Spring Data JPA with Hibernate?
                            
                                StorageFile 50 times slower than IsolatedStorageFile
                            
                                Fastest implementation of log2(int) and log2(float)
                            
                                What do you use to play sound in iPhone games?
                            
                                Best practice for storing tags in a database?
                            
                                Load files from one CDN or multiple CDNS
                            
                                How many table partitions is too many in Postgres?
                            
                                Do function pointers force an instruction pipeline to clear?
                            
                                F# seems slower than other languages... what can I do to speed it up? [closed]
                            
                                c++ passing arguments by reference and pointer
                            
                                Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs
                            
                                IIS vs Kestrel performance comparison
                            
                                Setting up a C# application for max performance build
                            
                                Under what circumstances is it advantageous to give an implementation of a pure virtual function?
                            
                                Performance considerations of Haskell FFI / C?
                            
                                How accurate is System.Diagnostics.Stopwatch?
                            
                                Cost of locking in .NET vs Java
                            
                                Why does changing `const ull` to `const ull&` in function parameter result in performance gain?
                            
                                Create SHA-256 hash from a Blob/File in javascript

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

MongoDB's performance on aggregation queries

Tags:

performance

mongodb

Mario Duarte

People also ask

2 Answers

TTT

jared

Recent Activity

Donate For Us