We've recently hit the >2 Million records for one of our main collections and now we started to suffer for major performance issues on that collection.
They documents in the collection have about 8 fields which you can filter by using UI and the results are supposed to sorted by a timestamp field the record was processed.
I've added several compound indexes with the filtered fields and the timetamp e.g:
db.events.ensureIndex({somefield: 1, timestamp:-1})
I've also added couple of indexes for using several filters at once to hopefully achieve better performance. But some filters still take awfully long time to perform.
I've made sure that using explain that the queries do use the indexes I've created but performance is still not good enough.
I was wondering if sharding is the way to go now.. but we will soon start to have about 1 million new records per day in that collection.. so I'm not sure if it will scale well..
EDIT: example for a query:
> db.audit.find({'userAgent.deviceType': 'MOBILE', 'user.userName': {$in: ['[email protected]']}}).sort({timestamp: -1}).limit(25).explain() { "cursor" : "BtreeCursor user.userName_1_timestamp_-1", "isMultiKey" : false, "n" : 0, "nscannedObjects" : 30060, "nscanned" : 30060, "nscannedObjectsAllPlans" : 120241, "nscannedAllPlans" : 120241, "scanAndOrder" : false, "indexOnly" : false, "nYields" : 1, "nChunkSkips" : 0, "millis" : 26495, "indexBounds" : { "user.userName" : [ [ "[email protected]", "[email protected]" ] ], "timestamp" : [ [ { "$maxElement" : 1 }, { "$minElement" : 1 } ] ] }, "server" : "yarin:27017" }
please note that deviceType has only 2 values in my collection.
Working with MongoDB and ElasticSearch is an accurate decision to process millions of records in real-time. These structures and concepts could be applied to larger datasets and will work extremely well too.
By default, MongoDB records all queries which take longer than 100 milliseconds.
Mongo can easily handle billions of documents and can have billions of documents in the one collection but remember that the maximum document size is 16mb. There are many folk with billions of documents in MongoDB and there's lots of discussions about it on the MongoDB Google User Group.
This is searching the needle in a haystack. We'd need some output of explain()
for those queries that don't perform well. Unfortunately, even that would fix the problem only for that particular query, so here's a strategy on how to approach this:
db.setProfilingLevel(1, timeout)
where timeout
is the threshold for the number of milliseconds the query or command takes, anything slower will be logged)db.system.profile
and run the queries manually using explain()
explain()
output, such as scanAndOrder
or large nscanned
, etc.A key problem is that you're apparently allowing your users to combine filters at will. Without index intersectioning, that will blow up the number of required indexes dramatically.
Also, blindly throwing an index at every possible query is a very bad strategy. It's important to structure the queries and make sure the indexed fields have sufficient selectivity.
Let's say you have a query for all users with status
"active" and some other criteria. But of the 5 million users, 3 million are active and 2 million aren't, so over 5 million entries there's only two different values. Such an index doesn't usually help. It's better to search for the other criteria first, then scan the results. On average, when returning 100 documents, you'll have to scan 167 documents, which won't hurt performance too badly. But it's not that simple. If the primary criterion is the joined_at
date of the user and the likelihood of users discontinuing use with time is high, you might end up having to scan thousands of documents before finding a hundred matches.
So the optimization depends very much on the data (not only its structure, but also the data itself), its internal correlations and your query patterns.
Things get worse when the data is too big for the RAM, because then, having an index is great, but scanning (or even simply returning) the results might require fetching a lot of data from disk randomly which takes a lot of time.
The best way to control this is to limit the number of different query types, disallow queries on low selectivity information and try to prevent random access to old data.
If all else fails and if you really need that much flexibility in filters, it might be worthwhile to consider a separate search DB that supports index intersections, fetch the mongo ids from there and then get the results from mongo using $in
. But that is fraught with its own perils.
-- EDIT --
The explain you posted is a beautiful example of a the problem with scanning low selectivity fields. Apparently, there's a lot of documents for "[email protected]". Now, finding those documents and sorting them descending by timestamp is pretty fast, because it's supported by high-selectivity indexes. Unfortunately, since there are only two device types, mongo needs to scan 30060 documents to find the first one that matches 'mobile'.
I assume this is some kind of web tracking, and the user's usage pattern makes the query slow (would he switch mobile and web on a daily basis, the query would be fast).
Making this particular query faster could be done using a compound index that contains the device type, e.g. using
a) ensureIndex({'username': 1, 'userAgent.deviceType' : 1, 'timestamp' :-1})
or
b) ensureIndex({'userAgent.deviceType' : 1, 'username' : 1, 'timestamp' :-1})
Unfortunately, that means that queries like find({"username" : "foo"}).sort({"timestamp" : -1});
can't use the same index anymore, so, as described, the number of indexes will grow very quickly.
I'm afraid there's no very good solution for this using mongodb at this time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With