I try to get all the distinct value (less than 10 possible values) of a given field in a large collection (3.500.000 docs, 35gb).
I tried to get these values with a db.collection.distinct('field')
, but it is very slow, even if there is an index (it doesn't seems to be used).
Any suggestions to improve performance on this query?
Thanks
Edit
I was using Mongo 2.4.9. It has been fixed in 2.5.5 (https://jira.mongodb.org/browse/SERVER-2094), but I still have a performance issue on queries like this db.logs.distinct( "version", {wsId: "XXX" })
even if indexes exists for both fields.
MongoDB – Distinct() Method In MongoDB, the distinct() method finds the distinct values for a given field across a single collection and returns the results in an array. It takes three parameters first one is the field for which to return distinct values and the others are optional.
Other ways to improve MongoDB performance after identifying your major query patterns include: Storing the results of frequent sub-queries on documents to reduce read load. Making sure that you have indices on any fields you regularly query against. Looking at your logs to identify slow queries, then check your indices.
The find() method is called on the Collection object that references the collection you want to query. The method accepts a query document that describes the documents you want to retrieve. For more information on how to specify your query document, see our guide on how to Specify a Query.
pretty() method is used to configure the cursor to display results in an easy-to-read format.
The question is a bit old, but I faced a similar issue now and ended up with the solution of caching distinct values (since in my case they are rarely changed)
Might be helpful for someone.
import datetime
import pickle
import redis
def get_distinct_values(collection, column_name):
redis = Redis(settings.REDIS_SERVER, settings.REDIS_PORT)
key = "distinct_{}".format(column_name)
if not redis.exists(key):
res = list(collection.distinct(column_name))
redis.set(key, pickle.dumps(res))
redis.expire(key, datetime.timedelta(days=1))
else:
res = pickle.loads(redis.get(key))
return res
"distinct" makes use of the the index if its available. Run it like this and see if index is being used:
db.runCommand({distinct: "collectionNameGoesHere", key:"fieldNameGoesHere"})
the last value in the returned result set is stats that looks like this:
"stats" : {
"n" : 280,
"nscanned" : 280,
"nscannedObjects" : 0,
"timems" : 0,
"cursor" : "BtreeCursor class_id_1"
}
Notice that my query had used an index on the class_id field since I had pre-made it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With