Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get random single document from 1 billion documents in mongoDB using python? [duplicate]

I want single random document from mongoDB collection. Now my mongoDB collection contains more then 1 billion collections. How to get single random document from that collection ?

like image 593
Hitul Mistry Avatar asked Nov 23 '12 07:11

Hitul Mistry


People also ask

How do I clone a document in MongoDB?

To clone a document, hover over the desired document and click the Clone button. When you click the Clone button, Compass opens the document insertion dialog with the same schema and values as the cloned document. You can edit any of these fields and values before you insert the new document.

Which command helps get first 10 documents from MongoDB?

On the MongoDB shell you can do: db. collectionName.

How do I select a single field for all documents in a MongoDB collection?

You can select a single field in MongoDB using the following syntax: db. yourCollectionName. find({"yourFieldName":yourValue},{"yourSingleFieldName":1,_id:0});


2 Answers

I never worked with MongoDB from Python, but there is a general solution for your problem. Here is a MongoDB shell script for obtaining single random document:

N = db.collection.count(condition)
db.collection.find(condition).limit(1).skip(Math.floor(Math.random()*N))

condition here is a MongoDB query. If you want to query an entire collection, use query = null.

It's a general solution, so it works with any MongoDB driver.


Update

I ran a benchmark to test several implementations. First, I created test collection with 5567249 documents with indexed random field rnd.

I chose three methods to compare with each other:

First method:

db.collection.find().limit(1).skip(Math.floor(Math.random()*N))

Second method:

db.collection.find({rnd: {$gte: Math.random()}}).sort({rnd:1}).limit(1)

Third method:

db.collection.findOne({rnd: {$gte: Math.random()}})

I ran each method 10 times and got its average computing time:

method 1: 882.1 msec
method 2: 1.2 msec
method 3: 0.6 msec

This benchmark shows that my solution not the fastest one.

But the third solution is not a good one either, because it finds the first element in database (sorted in natural order) with rnd > random(). So, its output not truly random.

I think that second method is the best one for frequent usage. But it has one defect: it requires altering the whole database and ensuring additional index.

like image 107
Leonid Beschastny Avatar answered Oct 15 '22 10:10

Leonid Beschastny


Add an additional column named random to your collection and make that the value in it is between 0 to 1. You can assign random floating points between 0 to 1 into this column for each record via [random.random() for _ in range(0, 10)].

Then:-

import random

collection = mongodb["collection_name"]

rand = random.random()  # rand will be a floating point between 0 to 1.
random_record = collection.find_one({ 'random' => { '$gte' => rand } })

MongoDB will have its native implementation in due course. Filed feature here - https://jira.mongodb.org/browse/SERVER-533

Not yet implemented at time of writing.

like image 43
Calvin Cheng Avatar answered Oct 15 '22 12:10

Calvin Cheng