Is there a good way to get distinct values in large collections in Mongo?

Tags:

distinct

I try to get all the distinct value (less than 10 possible values) of a given field in a large collection (3.500.000 docs, 35gb).

I tried to get these values with a db.collection.distinct('field'), but it is very slow, even if there is an index (it doesn't seems to be used).

Any suggestions to improve performance on this query?

Thanks

Edit I was using Mongo 2.4.9. It has been fixed in 2.5.5 (https://jira.mongodb.org/browse/SERVER-2094), but I still have a performance issue on queries like this db.logs.distinct( "version", {wsId: "XXX" }) even if indexes exists for both fields.

302

asked May 12 '14 12:05

Clément Poissonnier

2 Answers

The question is a bit old, but I faced a similar issue now and ended up with the solution of caching distinct values (since in my case they are rarely changed)

Might be helpful for someone.

Click to copy

import datetime
import pickle
import redis    

def get_distinct_values(collection, column_name):
    redis = Redis(settings.REDIS_SERVER, settings.REDIS_PORT)
    key = "distinct_{}".format(column_name)
    if not redis.exists(key):
        res = list(collection.distinct(column_name))
        redis.set(key, pickle.dumps(res))
        redis.expire(key, datetime.timedelta(days=1))
    else:
        res = pickle.loads(redis.get(key))
    return res

108

answered Sep 25 '22 01:09

Egor Wexler

"distinct" makes use of the the index if its available. Run it like this and see if index is being used:

Click to copy

db.runCommand({distinct: "collectionNameGoesHere", key:"fieldNameGoesHere"})

the last value in the returned result set is stats that looks like this:

Click to copy

   "stats" : {
           "n" : 280,
           "nscanned" : 280,
           "nscannedObjects" : 0,
           "timems" : 0,
           "cursor" : "BtreeCursor class_id_1"
   }

Notice that my query had used an index on the class_id field since I had pre-made it.

answered Sep 27 '22 01:09

alernerdev

Related questions
                            
                                Adding Batch Upsert to MongoDB.
                            
                                MongoDB connection problems on Azure
                            
                                MongoDB: Can't initiate replica set; 'has data already, cannot initiate set'
                            
                                Aggregate Conversion String to Int to with Mongo 3.2.9
                            
                                MongoDB - Aggregation Framework (Total Count)
                            
                                MongoDB field increment with max condition in update statement
                            
                                MongoDB and Entity Framework Core 2.0
                            
                                MongoDB Schema Design - Real-time Chat
                            
                                Can you stream video from GridFS (MongoDB filesystem)?
                            
                                Android + NoSQL
                            
                                querying on 10 million mongodb documents
                            
                                MongoDB: set user/password to access to db
                            
                                How efficient are MongoDB projections?
                            
                                Why 2700 records (320KB each) should take 30 seconds to be fetched?
                            
                                Understanding MongoDB (and NoSQL in general) and how to make the best use of it
                            
                                Why is Spring Data MongoDB unable to instantiate this nested type structure?
                            
                                Error while adding a secondary instance in mongodb replica set
                            
                                @CompoundIndex not working in Spring Data MongoDB
                            
                                Mongodb error : The 'cursor' option is required, except for aggregation explain
                            
                                What is a good horizontal scaling strategy for a MongoDB change stream reader?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there a good way to get distinct values in large collections in Mongo?

Tags:

mongodb

distinct

Clément Poissonnier

People also ask

2 Answers

Egor Wexler

alernerdev

Recent Activity

Donate For Us