I'm looking for an efficient way of randomly selecting 100 rows satisfying certain conditions from a MySQL table with potentially millions of rows.
Almost everything I've found suggests avoiding the use of ORDER BY RAND(), because of poor performance and scalability.
However, this article suggests ORDER BY RAND() may still be used as a "nice and fast way" to fetch randow data.
Based on this article, below is some example code showing what I'm trying to accomplish. My questions are:
Is this an efficient way of randomly selecting 100 (or up to several hundred) rows from a table with potentially millions of rows?
When will performance become an issue?
SELECT user.* FROM ( SELECT id FROM user WHERE is_active = 1 AND deleted = 0 AND expiretime > '.time().' AND id NOT IN (10, 13, 15) AND id NOT IN (20, 30, 50) AND id NOT IN (103, 140, 250) ORDER BY RAND() LIMIT 100 ) AS random_users STRAIGHT JOIN user ON user.id = random_users.id
Is strongly urge you to read this article. The last segment will be covering the selection of multiple random row. And you should be able to notice the SELECT
statement in the PROCEDURE
that will be described there. That would be the spot where you add your specific WHERE
conditions.
The problem with ORDER BY RAND()
is that this operation has complexity of n*log2(n)
, while the method described in the article that I linked, has almost constant complexity.
Lets assume, that selecting random row from table, which contains 10 entries, using ORDER BY RAND()
takes 1 time unit
:
entries | time units
-------------------------
10 | 1 /* if this takes 0.001s */
100 | 20
1'000 | 300
10'000 | 4'000
100'000 | 50'000
1'000'000 | 600'000 /* then this will need 10 minutes */
And you wrote that you are dealing with table on scale of millions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With