I would like to select a random row from a very large table (10 mil records). So the strategy that are most common such as RAND()
and NEWID()
doesn't seem to be practical.
I have tried the following strategy and would like to know if this is the most ideal way.
Create a new field called 'RandomSort' as UniqueIdentified
At the end of each hour/day will do a Update RandomSort = NewID()
to the entire table
Top 10 Order by RandomSort
It does get the job done (better than ORDER BY NewID
), but not sure if this is the best practice so far?
Add an identity column 'rowid' (int
or bigint
depending on your table size) and create a unique non-clustered index on it.
The following query uses the NEWID()
function to return approximately one percent of the rows of the table:
SELECT * FROM MyTable
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), rowID) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)
The rowId column is included in the CHECKSUM expression so that NEWID() evaluates once per row to achieve sampling on a per-row basis. The expression CAST(CHECKSUM(NEWID(), rowid) & 0x7fffffff AS float / CAST(0x7fffffff AS int)
evaluates to a random float value between 0 and 1.
In fact you could use any column indexed column in your table (I believe).
If you just want to pick a single random row:
SELECT TOP 1 * FROM table
WHERE rowid >= RAND(CHECKSUM(NEWID())) * (SELECT MAX(rowid) FROM table)
This works in constant time, provided the rowid
column is indexed. Note: this assumes that rowid
is uniformly distributed in the range 0..MAX(rowid)
, hence the suggested identity column addition. If your dataset has some other distribution, your results will be skewed (i.e. some rows will be picked more often than others).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With