Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Random sort order

The question about the way to get a random document from collection has been asked many times and there were suggestions on this topic.

What I need is to get several random documents from collection and what is even worse - those documents must match certain criterias (filtered, I mean). For example, I have a collection of articles where each article has a 'topic' field. The User chooses a topic he's interested in, and my db must show the corresponding articles each time in random order.

Obviously, previously discussed hacks won't help me. The only way to achieve what I want is to query for corresponding topic getting ids only:

var arr = db.articles.find({topic: 3}, {_id:1}).toArray();

and then generate random sequense of numbers depending on how many documents were received and then obtain document ids from the array using random numbers as indexes of that array and then finally do another request to mongodb to obtain documents with those randomly chosen ids.

As you can see, it seems a little bit too slow altogether, especially, if there are too many articles returned by first query:)

So what I think is that there may be some mongodb command to get documents by index keys based on their position in the index. The point is that I can create covered compound index like this:

db.articles.ensureIndex({topic: 1, _id:1});

And now my query would only have to scan the continuos line of right _ids in index. And if I could request the documents from the collection by those '_ids' positions, then I could do the whole thing in one request! Something like:

var cursor = db.articles.find({topic:3, $indexKeyPosition: {$in: myRandomSequence}});

Does anyone know about such features?

like image 761
Andrey Kon Avatar asked Dec 17 '12 08:12

Andrey Kon


People also ask

Can Excel randomize a list?

Just type the name and empty parenthesis and press Enter – and a random number between 0 and 1 is added to the cell. Now, copy it down to create a whole column of random numbers that you need to randomize the list. Combined with sorting, you are now able to shuffle cells in Excel.

What is a random number sequence?

As the term suggests, a random number is a number chosen by chance -- i.e., randomly, from a set of numbers. All the numbers in a specified distribution have equal probability of being chosen randomly.


2 Answers

Nowadays, you should be able to use the $sample aggregation function.

Example (untested):

db.articles.aggregate([
    { $match : { topic : 3 } },
    { $sample : { size: 3 } }
])

Note, however, that it may return the same document more than once.

like image 186
Claudiu Avatar answered Sep 20 '22 13:09

Claudiu


So what I think is that there may be some mongodb command to get documents by index keys based on their position in the index. The point is that I can create covered compound index like this:

No there isn't such a function in MongoDB, though it is a good idea to be able to randomise a result set. In the meantime here is a JIRA: https://jira.mongodb.org/browse/SERVER-533

Since there is no way to pick from a position of index so that it can use an index and consequently a single round trip you have no choice but you open multiple cursors.

The current solution depends on how many documents are in your result set.

If there are a small number of document in your result set you can solve this with an easy skip(rand()) and limit(1) however you must be aware that both skip() and limit() make no effective use of indexes.

That does not mean it will scan the entire Btree it means it will scan as far as you skip().

This means that if your result set grows large and the rand() becomes a large number you will see serious performance problems, as many have.

One good way of possibly solving this is to maintain either:

  • A rand() field value between 0 and 1
  • Or a incremental ID field ( http://www.mongodb.org/display/DOCS/How+to+Make+an+Auto+Incrementing+Field )

And use that new field to "skip" on using the rest of your query like:

var arr = db.articles.find({topic: 3, rand: rand()}, {_id:1}).limit(7).toArray();

Will get 7 random rows using the 0 to 1 idea.

The random sort abilities of this rely on an ever changing data set to help create a randomness within the sort. This will, of course, not work if the result set is continously static.

As for using batchSize, it becomes irrelavant here and infact normally does. For example your logic of using BatchSize to get all your results back does not make complete sense since the BatchSize normally has a absolute max size of 16MB. This means that if your documents are large you may not be getting the single round trip you think you are.

This also only dictates that the server will send all this data at once, it does not denote the amount of work placed on the server, just the amount of data sent over the wire at once.

So considering that you must do this with mutliple cursors (the way I would recommend) you can just run:

var arr = db.articles.find({topic: 3, rand: {$gte:rand()}}).sort({rand:1}).limit(1);

Several, or however many you need, times over. This is not much different to a normal iteration of a cursor and provided you have the right indexes should be quite fast.

There is one other method, but I do not recommend it. You can run an MR, say, once every hour or something which creates another collection of _id and rand() this means you can do the first query I showed:

var arr = db.articles.find({topic: 3, rand: rand()}, {_id:1}).limit(7).toArray();

And really get 7 random records, since the rand() will, of course, be different. But this is not real time and nor would it be very nice for your server on a large data set as such I do not recommend such a thing.

Edit

There is one other way. With the auto incrementing id you could do an $or statement to pick out 7 rand()s all at once. However this introduces yet another problem, deleting.

If you delete any rows you might hit a rand() that doesn't exist and so no row will be returned. Since the auto incrementing id is not maintained to counter deletes server-side you would have to do this yourself. This would not be an easy or scalable thing to do.

To add to this $or statements cannot be limit()ed on clause which means you can't get around this by doing sub select type $ors to make MongoDB only pick out one result per $or clause using $gte.

The same applies for a rand() between 0 and 1. This would work with an $or if you could limit clauses.

like image 20
Sammaye Avatar answered Sep 19 '22 13:09

Sammaye