I am looking to get a random record from a huge collection (100 million records). What is the fastest and most efficient way to do so? The data is already there and there are no field in which I can generate a random number and obtain a random row.

Do a count of all records, generate a random number between 0 and the count, and then do: <pre class="prettyprint"><code>db.yourCollection.find().limit(-1).skip(yourRandomNumber).next() </code></pre>

<h3>Update for MongoDB 3.2</h3> 3.2 introduced $sample to the aggregation pipeline. There's also a good blog post on putting it into practice. <h3>For older versions (previous answer)</h3> This was actually a feature request: http://jira.mongodb.org/browse/SERVER-533 but it was filed under "Won't fix." The cookbook has a very good recipe to select a random document out of a collection: http://cookbook.mongodb.org/patterns/random-attribute/ To paraphrase the recipe, you assign random numbers to your documents: <pre class="prettyprint"><code>db.docs.save( { key : 1, ..., random : Math.random() } ) </code></pre> Then select a random document: <pre class="prettyprint"><code>rand = Math.random() result = db.docs.findOne( { key : 2, random : { $gte : rand } } ) if ( result == null ) { result = db.docs.findOne( { key : 2, random : { $lte : rand } } ) } </code></pre> Querying with both <code>$gte</code> and <code>$lte</code> is necessary to find the document with a random number nearest <code>rand</code>. And of course you'll want to index on the random field: <pre class="prettyprint"><code>db.docs.ensureIndex( { key : 1, random :1 } ) </code></pre> If you're already querying against an index, simply drop it, append <code>random: 1</code> to it, and add it again.

You can also use MongoDB's geospatial indexing feature to select the documents 'nearest' to a random number. First, enable geospatial indexing on a collection: <pre class="prettyprint"><code>db.docs.ensureIndex( { random_point: '2d' } ) </code></pre> To create a bunch of documents with random points on the X-axis: <pre class="prettyprint"><code>for ( i = 0; i < 10; ++i ) { db.docs.insert( { key: i, random_point: [Math.random(), 0] } ); } </code></pre> Then you can get a random document from the collection like this: <pre class="prettyprint"><code>db.docs.findOne( { random_point : { $near : [Math.random(), 0] } } ) </code></pre> Or you can retrieve several document nearest to a random point: <pre class="prettyprint"><code>db.docs.find( { random_point : { $near : [Math.random(), 0] } } ).limit( 4 ) </code></pre> This requires only one query and no null checks, plus the code is clean, simple and flexible. You could even use the Y-axis of the geopoint to add a second randomness dimension to your query.

Now you can use the aggregate. Example: <pre class="prettyprint"><code>db.users.aggregate( [ { $sample: { size: 3 } } ] ) </code></pre> See the doc.

Using Python (pymongo), the aggregate function also works. <pre class="prettyprint"><code>collection.aggregate([{'$sample': {'size': sample_size }}]) </code></pre> This approach is a lot faster than running a query for a random number (e.g. collection.find([random_int]). This is especially the case for large collections.

it is tough if there is no data there to key off of. what are the _id field? are they mongodb object id's? If so, you could get the highest and lowest values: <pre class="prettyprint"><code>lowest = db.coll.find().sort({_id:1}).limit(1).next()._id; highest = db.coll.find().sort({_id:-1}).limit(1).next()._id; </code></pre> then if you assume the id's are uniformly distributed (but they aren't, but at least it's a start): <pre class="prettyprint"><code>unsigned long long L = first_8_bytes_of(lowest) unsigned long long H = first_8_bytes_of(highest) V = (H - L) * random_from_0_to_1(); N = L + V; oid = N concat random_4_bytes(); randomobj = db.coll.find({_id:{$gte:oid}}).limit(1); </code></pre>

Random record from MongoDB

22 Answers

Starting with the 3.2 release of MongoDB, you can get N random docs from a collection using the $sample aggregation pipeline operator:

// Get one random document from the mycoll collection.
db.mycoll.aggregate([{ $sample: { size: 1 } }])

If you want to select the random document(s) from a filtered subset of the collection, prepend a $match stage to the pipeline:

// Get one random document matching {a: 10} from the mycoll collection.
db.mycoll.aggregate([
    { $match: { a: 10 } },
    { $sample: { size: 1 } }
])

As noted in the comments, when size is greater than 1, there may be duplicates in the returned document sample.

190

answered Oct 05 '22 02:10

JohnnyHK

Do a count of all records, generate a random number between 0 and the count, and then do:

db.yourCollection.find().limit(-1).skip(yourRandomNumber).next()

answered Oct 05 '22 03:10

ceejayoz

Update for MongoDB 3.2

3.2 introduced $sample to the aggregation pipeline.

There's also a good blog post on putting it into practice.

For older versions (previous answer)

This was actually a feature request: http://jira.mongodb.org/browse/SERVER-533 but it was filed under "Won't fix."

The cookbook has a very good recipe to select a random document out of a collection: http://cookbook.mongodb.org/patterns/random-attribute/

To paraphrase the recipe, you assign random numbers to your documents:

db.docs.save( { key : 1, ..., random : Math.random() } )

Then select a random document:

rand = Math.random()
result = db.docs.findOne( { key : 2, random : { $gte : rand } } )
if ( result == null ) {
  result = db.docs.findOne( { key : 2, random : { $lte : rand } } )
}

Querying with both $gte and $lte is necessary to find the document with a random number nearest rand.

And of course you'll want to index on the random field:

db.docs.ensureIndex( { key : 1, random :1 } )

If you're already querying against an index, simply drop it, append random: 1 to it, and add it again.

answered Oct 05 '22 02:10

Michael

You can also use MongoDB's geospatial indexing feature to select the documents 'nearest' to a random number.

First, enable geospatial indexing on a collection:

db.docs.ensureIndex( { random_point: '2d' } )

To create a bunch of documents with random points on the X-axis:

for ( i = 0; i < 10; ++i ) {
    db.docs.insert( { key: i, random_point: [Math.random(), 0] } );
}

Then you can get a random document from the collection like this:

db.docs.findOne( { random_point : { $near : [Math.random(), 0] } } )

Or you can retrieve several document nearest to a random point:

db.docs.find( { random_point : { $near : [Math.random(), 0] } } ).limit( 4 )

This requires only one query and no null checks, plus the code is clean, simple and flexible. You could even use the Y-axis of the geopoint to add a second randomness dimension to your query.

answered Oct 05 '22 03:10

Nico de Poel

The following recipe is a little slower than the mongo cookbook solution (add a random key on every document), but returns more evenly distributed random documents. It's a little less-evenly distributed than the skip( random ) solution, but much faster and more fail-safe in case documents are removed.

function draw(collection, query) {
    // query: mongodb query object (optional)
    var query = query || { };
    query['random'] = { $lte: Math.random() };
    var cur = collection.find(query).sort({ rand: -1 });
    if (! cur.hasNext()) {
        delete query.random;
        cur = collection.find(query).sort({ rand: -1 });
    }
    var doc = cur.next();
    doc.random = Math.random();
    collection.update({ _id: doc._id }, doc);
    return doc;
}

It also requires you to add a random "random" field to your documents so don't forget to add this when you create them : you may need to initialize your collection as shown by Geoffrey

function addRandom(collection) { 
    collection.find().forEach(function (obj) {
        obj.random = Math.random();
        collection.save(obj);
    }); 
} 
db.eval(addRandom, db.things);

Benchmark results

This method is much faster than the skip() method (of ceejayoz) and generates more uniformly random documents than the "cookbook" method reported by Michael:

For a collection with 1,000,000 elements:

This method takes less than a millisecond on my machine
the skip() method takes 180 ms on average

The cookbook method will cause large numbers of documents to never get picked because their random number does not favor them.

This method will pick all elements evenly over time.
In my benchmark it was only 30% slower than the cookbook method.
the randomness is not 100% perfect but it is very good (and it can be improved if necessary)

This recipe is not perfect - the perfect solution would be a built-in feature as others have noted.
However it should be a good compromise for many purposes.

answered Oct 05 '22 03:10

spam_eggs

Here is a way using the default ObjectId values for _id and a little math and logic.

// Get the "min" and "max" timestamp values from the _id in the collection and the 
// diff between.
// 4-bytes from a hex string is 8 characters

var min = parseInt(db.collection.find()
        .sort({ "_id": 1 }).limit(1).toArray()[0]._id.str.substr(0,8),16)*1000,
    max = parseInt(db.collection.find()
        .sort({ "_id": -1 })limit(1).toArray()[0]._id.str.substr(0,8),16)*1000,
    diff = max - min;

// Get a random value from diff and divide/multiply be 1000 for The "_id" precision:
var random = Math.floor(Math.floor(Math.random(diff)*diff)/1000)*1000;

// Use "random" in the range and pad the hex string to a valid ObjectId
var _id = new ObjectId(((min + random)/1000).toString(16) + "0000000000000000")

// Then query for the single document:
var randomDoc = db.collection.find({ "_id": { "$gte": _id } })
   .sort({ "_id": 1 }).limit(1).toArray()[0];

That's the general logic in shell representation and easily adaptable.

So in points:

Find the min and max primary key values in the collection
Generate a random number that falls between the timestamps of those documents.
Add the random number to the minimum value and find the first document that is greater than or equal to that value.

This uses "padding" from the timestamp value in "hex" to form a valid ObjectId value since that is what we are looking for. Using integers as the _id value is essentially simplier but the same basic idea in the points.

answered Oct 05 '22 03:10

Blakes Seven

Now you can use the aggregate. Example:

db.users.aggregate(
   [ { $sample: { size: 3 } } ]
)

See the doc.

answered Oct 05 '22 01:10

dbam

In Python using pymongo:

import random

def get_random_doc():
    count = collection.count()
    return collection.find()[random.randrange(count)]

answered Oct 05 '22 01:10

Jabba

Using Python (pymongo), the aggregate function also works.

collection.aggregate([{'$sample': {'size': sample_size }}])

This approach is a lot faster than running a query for a random number (e.g. collection.find([random_int]). This is especially the case for large collections.

answered Oct 05 '22 03:10

Daniel

it is tough if there is no data there to key off of. what are the _id field? are they mongodb object id's? If so, you could get the highest and lowest values:

lowest = db.coll.find().sort({_id:1}).limit(1).next()._id;
highest = db.coll.find().sort({_id:-1}).limit(1).next()._id;

then if you assume the id's are uniformly distributed (but they aren't, but at least it's a start):

unsigned long long L = first_8_bytes_of(lowest)
unsigned long long H = first_8_bytes_of(highest)

V = (H - L) * random_from_0_to_1();
N = L + V;
oid = N concat random_4_bytes();

randomobj = db.coll.find({_id:{$gte:oid}}).limit(1);

answered Oct 05 '22 03:10

dm.

You can pick a random timestamp and search for the first object that was created afterwards. It will only scan a single document, though it doesn't necessarily give you a uniform distribution.

var randRec = function() {
    // replace with your collection
    var coll = db.collection
    // get unixtime of first and last record
    var min = coll.find().sort({_id: 1}).limit(1)[0]._id.getTimestamp() - 0;
    var max = coll.find().sort({_id: -1}).limit(1)[0]._id.getTimestamp() - 0;

    // allow to pass additional query params
    return function(query) {
        if (typeof query === 'undefined') query = {}
        var randTime = Math.round(Math.random() * (max - min)) + min;
        var hexSeconds = Math.floor(randTime / 1000).toString(16);
        var id = ObjectId(hexSeconds + "0000000000000000");
        query._id = {$gte: id}
        return coll.find(query).limit(1)
    };
}();

answered Oct 05 '22 01:10

Martin Nowak

My solution on php:

/**
 * Get random docs from Mongo
 * @param $collection
 * @param $where
 * @param $fields
 * @param $limit
 * @author happy-code
 * @url happy-code.com
 */
private function _mongodb_get_random (MongoCollection $collection, $where = array(), $fields = array(), $limit = false) {

    // Total docs
    $count = $collection->find($where, $fields)->count();

    if (!$limit) {
        // Get all docs
        $limit = $count;
    }

    $data = array();
    for( $i = 0; $i < $limit; $i++ ) {

        // Skip documents
        $skip = rand(0, ($count-1) );
        if ($skip !== 0) {
            $doc = $collection->find($where, $fields)->skip($skip)->limit(1)->getNext();
        } else {
            $doc = $collection->find($where, $fields)->limit(1)->getNext();
        }

        if (is_array($doc)) {
            // Catch document
            $data[ $doc['_id']->{'$id'} ] = $doc;
            // Ignore current document when making the next iteration
            $where['_id']['$nin'][] = $doc['_id'];
        }

        // Every iteration catch document and decrease in the total number of document
        $count--;

    }

    return $data;
}

answered Oct 05 '22 01:10

code_turist

In order to get a determinated number of random docs without duplicates:

first get all ids
get size of documents

loop geting random index and skip duplicated

number_of_docs=7
db.collection('preguntas').find({},{_id:1}).toArray(function(err, arr) {
count=arr.length
idsram=[]
rans=[]
while(number_of_docs!=0){
    var R = Math.floor(Math.random() * count);
    if (rans.indexOf(R) > -1) {
     continue
      } else {           
               ans.push(R)
               idsram.push(arr[R]._id)
               number_of_docs--
                }
    }
db.collection('preguntas').find({}).toArray(function(err1, doc1) {
                if (err1) { console.log(err1); return;  }
               res.send(doc1)
            });
        });

answered Oct 05 '22 02:10

Fabio Guerra

I would suggest using map/reduce, where you use the map function to only emit when a random value is above a given probability.

function mapf() {
    if(Math.random() <= probability) {
    emit(1, this);
    }
}

function reducef(key,values) {
    return {"documents": values};
}

res = db.questions.mapReduce(mapf, reducef, {"out": {"inline": 1}, "scope": { "probability": 0.5}});
printjson(res.results);

The reducef function above works because only one key ('1') is emitted from the map function.

The value of the "probability" is defined in the "scope", when invoking mapRreduce(...)

Using mapReduce like this should also be usable on a sharded db.

If you want to select exactly n of m documents from the db, you could do it like this:

function mapf() {
    if(countSubset == 0) return;
    var prob = countSubset / countTotal;
    if(Math.random() <= prob) {
        emit(1, {"documents": [this]}); 
        countSubset--;
    }
    countTotal--;
}

function reducef(key,values) {
    var newArray = new Array();
for(var i=0; i < values.length; i++) {
    newArray = newArray.concat(values[i].documents);
}

return {"documents": newArray};
}

res = db.questions.mapReduce(mapf, reducef, {"out": {"inline": 1}, "scope": {"countTotal": 4, "countSubset": 2}})
printjson(res.results);

Where "countTotal" (m) is the number of documents in the db, and "countSubset" (n) is the number of documents to retrieve.

This approach might give some problems on sharded databases.

answered Oct 05 '22 01:10

torbenl

You can pick random _id and return corresponding object:

 db.collection.count( function(err, count){
        db.collection.distinct( "_id" , function( err, result) {
            if (err)
                res.send(err)
            var randomId = result[Math.floor(Math.random() * (count-1))]
            db.collection.findOne( { _id: randomId } , function( err, result) {
                if (err)
                    res.send(err)
                console.log(result)
            })
        })
    })

Here you dont need to spend space on storing random numbers in collection.

answered Oct 05 '22 02:10

Vijay13

I'd suggest adding a random int field to each object. Then you can just do a

findOne({random_field: {$gte: rand()}})

to pick a random document. Just make sure you ensureIndex({random_field:1})

answered Oct 05 '22 01:10

mstearn

The following aggregation operation randomly selects 3 documents from the collection:

db.users.aggregate( [ { $sample: { size: 3 } } ] )

https://docs.mongodb.com/manual/reference/operator/aggregation/sample/

answered Oct 05 '22 03:10

Anup Panwar

MongoDB now has $rand

To pick n non repeat items, aggregate with { $addFields: { _f: { $rand: {} } } } then $sort by _f and $limit n.

answered Oct 05 '22 02:10

Polv

When I was faced with a similar solution, I backtracked and found that the business request was actually for creating some form of rotation of the inventory being presented. In that case, there are much better options, which have answers from search engines like Solr, not data stores like MongoDB.

In short, with the requirement to "intelligently rotate" content, what we should do instead of a random number across all of the documents is to include a personal q score modifier. To implement this yourself, assuming a small population of users, you can store a document per user that has the productId, impression count, click-through count, last seen date, and whatever other factors the business finds as being meaningful to compute a q score modifier. When retrieving the set to display, typically you request more documents from the data store than requested by the end user, then apply the q score modifier, take the number of records requested by the end user, then randomize the page of results, a tiny set, so simply sort the documents in the application layer (in memory).

If the universe of users is too large, you can categorize users into behavior groups and index by behavior group rather than user.

If the universe of products is small enough, you can create an index per user.

I have found this technique to be much more efficient, but more importantly more effective in creating a relevant, worthwhile experience of using the software solution.

answered Oct 05 '22 01:10

paegun

non of the solutions worked well for me. especially when there are many gaps and set is small. this worked very well for me(in php):

$count = $collection->count($search);
$skip = mt_rand(0, $count - 1);
$result = $collection->find($search)->skip($skip)->limit(1)->getNext();

answered Oct 05 '22 03:10

Mantas Karanauskas

My PHP/MongoDB sort/order by RANDOM solution. Hope this helps anyone.

Note: I have numeric ID's within my MongoDB collection that refer to a MySQL database record.

First I create an array with 10 randomly generated numbers

    $randomNumbers = [];
    for($i = 0; $i < 10; $i++){
        $randomNumbers[] = rand(0,1000);
    }

In my aggregation I use the $addField pipeline operator combined with $arrayElemAt and $mod (modulus). The modulus operator will give me a number from 0 - 9 which I then use to pick a number from the array with random generated numbers.

    $aggregate[] = [
        '$addFields' => [
            'random_sort' => [ '$arrayElemAt' => [ $randomNumbers, [ '$mod' => [ '$my_numeric_mysql_id', 10 ] ] ] ],
        ],
    ];

After that you can use the sort Pipeline.

    $aggregate[] = [
        '$sort' => [
            'random_sort' => 1
        ]
    ];

answered Oct 05 '22 02:10

feskr

The best way in Mongoose is to make an aggregation call with $sample. However, Mongoose does not apply Mongoose documents to Aggregation - especially not if populate() is to be applied as well.

For getting a "lean" array from the database:

/*
Sample model should be init first
const Sample = mongoose …
*/

const samples = await Sample.aggregate([
  { $match: {} },
  { $sample: { size: 33 } },
]).exec();
console.log(samples); //a lean Array

For getting an array of mongoose documents:

const samples = (
  await Sample.aggregate([
    { $match: {} },
    { $sample: { size: 27 } },
    { $project: { _id: 1 } },
  ]).exec()
).map(v => v._id);

const mongooseSamples = await Sample.find({ _id: { $in: samples } });

console.log(mongooseSamples); //an Array of mongoose documents

answered Oct 05 '22 02:10

TG___

Related questions
                            
                                Stop Mongoose from creating _id property for sub-document array items
                            
                                MongoDB: Combine data from multiple collections into one..how?
                            
                                How to paginate with Mongoose in Node.js?
                            
                                NoSQL (MongoDB) vs Lucene (or Solr) as your database [closed]
                            
                                How to print out more than 20 items (documents) in MongoDB's shell?
                            
                                E11000 duplicate key error index in mongodb mongoose
                            
                                How to select a single field for all documents in a MongoDB collection?
                            
                                Render basic HTML view?
                            
                                mongodb/mongoose findMany - find all documents with IDs listed in array
                            
                                How to analyze disk usage of a Docker container
                            
                                Ways to implement data versioning in MongoDB
                            
                                How do I remove documents using Node.js Mongoose?
                            
                                How do I manage MongoDB connections in a Node.js web application?
                            
                                How do I search for an object by its ObjectId in the mongo console?
                            
                                Get names of all keys in the collection
                            
                                MySQL vs MongoDB 1000 reads
                            
                                "continue" in cursor.forEach()
                            
                                Mongoose: findOneAndUpdate doesn't return updated document
                            
                                MongoDB: Is it possible to make a case-insensitive query?
                            
                                What is the "__v" field in Mongoose

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Random record from MongoDB

Tags:

random

mongodb

mongodb-query

Will M

People also ask

22 Answers

JohnnyHK

ceejayoz

Update for MongoDB 3.2

For older versions (previous answer)

Michael

Nico de Poel

spam_eggs

Blakes Seven

dbam

Jabba

Daniel

dm.

Martin Nowak

code_turist

Fabio Guerra

torbenl

Vijay13

mstearn

Anup Panwar

Polv

paegun

Mantas Karanauskas

feskr

TG___

Recent Activity

Donate For Us