Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a good way to get distinct values in large collections in Mongo?

I try to get all the distinct value (less than 10 possible values) of a given field in a large collection (3.500.000 docs, 35gb).

I tried to get these values with a db.collection.distinct('field'), but it is very slow, even if there is an index (it doesn't seems to be used).

Any suggestions to improve performance on this query?

Thanks

Edit I was using Mongo 2.4.9. It has been fixed in 2.5.5 (https://jira.mongodb.org/browse/SERVER-2094), but I still have a performance issue on queries like this db.logs.distinct( "version", {wsId: "XXX" }) even if indexes exists for both fields.

like image 302
Clément Poissonnier Avatar asked May 12 '14 12:05

Clément Poissonnier


People also ask

How do I get distinct records in MongoDB?

MongoDB – Distinct() Method In MongoDB, the distinct() method finds the distinct values for a given field across a single collection and returns the results in an array. It takes three parameters first one is the field for which to return distinct values and the others are optional.

What would be the best way to improve the performance of a MongoDB query where your query references particular field within a document?

Other ways to improve MongoDB performance after identifying your major query patterns include: Storing the results of frequent sub-queries on documents to reduce read load. Making sure that you have indices on any fields you regularly query against. Looking at your logs to identify slow queries, then check your indices.

Which method can be used to retrieve data from collection in MongoDB?

The find() method is called on the Collection object that references the collection you want to query. The method accepts a query document that describes the documents you want to retrieve. For more information on how to specify your query document, see our guide on how to Specify a Query.

What is the use of pretty () method in MongoDB?

pretty() method is used to configure the cursor to display results in an easy-to-read format.


2 Answers

The question is a bit old, but I faced a similar issue now and ended up with the solution of caching distinct values (since in my case they are rarely changed)

Might be helpful for someone.

import datetime
import pickle
import redis    

def get_distinct_values(collection, column_name):
    redis = Redis(settings.REDIS_SERVER, settings.REDIS_PORT)
    key = "distinct_{}".format(column_name)
    if not redis.exists(key):
        res = list(collection.distinct(column_name))
        redis.set(key, pickle.dumps(res))
        redis.expire(key, datetime.timedelta(days=1))
    else:
        res = pickle.loads(redis.get(key))
    return res
like image 108
Egor Wexler Avatar answered Sep 25 '22 01:09

Egor Wexler


"distinct" makes use of the the index if its available. Run it like this and see if index is being used:

db.runCommand({distinct: "collectionNameGoesHere", key:"fieldNameGoesHere"})

the last value in the returned result set is stats that looks like this:

   "stats" : {
           "n" : 280,
           "nscanned" : 280,
           "nscannedObjects" : 0,
           "timems" : 0,
           "cursor" : "BtreeCursor class_id_1"
   }

Notice that my query had used an index on the class_id field since I had pre-made it.

like image 40
alernerdev Avatar answered Sep 27 '22 01:09

alernerdev