Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL Large Table select random row strategy

I would like to select a random row from a very large table (10 mil records). So the strategy that are most common such as RAND() and NEWID() doesn't seem to be practical.

I have tried the following strategy and would like to know if this is the most ideal way.

  1. Create a new field called 'RandomSort' as UniqueIdentified

  2. At the end of each hour/day will do a Update RandomSort = NewID() to the entire table

  3. Each time I need to query, I can do a Top 10 Order by RandomSort

It does get the job done (better than ORDER BY NewID), but not sure if this is the best practice so far?

like image 303
Lasker Avatar asked May 26 '15 07:05

Lasker


1 Answers

Add an identity column 'rowid' (int or bigint depending on your table size) and create a unique non-clustered index on it.

The following query uses the NEWID() function to return approximately one percent of the rows of the table:

SELECT * FROM MyTable
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), rowID) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)

The rowId column is included in the CHECKSUM expression so that NEWID() evaluates once per row to achieve sampling on a per-row basis. The expression CAST(CHECKSUM(NEWID(), rowid) & 0x7fffffff AS float / CAST(0x7fffffff AS int) evaluates to a random float value between 0 and 1.

In fact you could use any column indexed column in your table (I believe).

If you just want to pick a single random row:

SELECT TOP 1 * FROM table 
WHERE rowid >= RAND(CHECKSUM(NEWID())) * (SELECT MAX(rowid) FROM table) 

This works in constant time, provided the rowid column is indexed. Note: this assumes that rowid is uniformly distributed in the range 0..MAX(rowid), hence the suggested identity column addition. If your dataset has some other distribution, your results will be skewed (i.e. some rows will be picked more often than others).

like image 90
Mitch Wheat Avatar answered Oct 13 '22 21:10

Mitch Wheat