Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generating random values from uniform distribution with setting a seed in T-SQL

I would like to generate a random value from the uniform distribution with mean=0 and a standard devation=1 for every row of a given data table in T-SQL. Additionally, I would like to set a seed in order to ensure reproducibility of analysis. Here are the ideas which did not work:

  1. Using the function RAND() with a declared number does not fulfill this objective: the same random value is generated for every row of the data set.

  2. Such a solution:

    SELECT ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)) AS [RandomNumber]

does not solve the problem either since it is not reproducible.

EDIT:

The performance does matter since my table has hundreds of millions of records.

like image 961
kaksat Avatar asked Mar 01 '17 22:03

kaksat


People also ask

How do you generate a random number from a uniform distribution?

The inversion method relies on the principle that continuous cumulative distribution functions (cdfs) range uniformly over the open interval (0,1). If u is a uniform random number on (0,1), then x = F - 1 ( u ) generates a random number x from any continuous distribution with the specified cdf F .

How do you generate random values in SQL?

SQL Server RAND() Function The RAND() function returns a random number between 0 (inclusive) and 1 (exclusive).

How do I generate a random 10 digit number in SQL?

There are several methods to generate a random number. One simple method is to make use of RAND() function available in SQL Server. RAND() function simply generate a decimal number between 0 (inclusive) and 1 (exclusive).


4 Answers

The main question here IMHO is how you see 'repeatabililty'? Or asked differently: what 'drives' the randomness? I can envision a solution that sticks the same random number to every record for each run as long as the data doesn't change. However, what do you expect to happen if the data changes?

For the fun of it, I did the following tests on a (not very representative) test-table with 1 million of rows:

-- seed
SELECT Rand(0)

-- will show the same random number for EVERY record
SELECT Number, blah = Convert(varchar(100), NewID()), random = Rand()
  INTO #test
  FROM master.dbo.fn_int_list(1, 1000000)

CREATE UNIQUE CLUSTERED INDEX uq0_test ON #test (Number)

SET NOCOUNT ON

GO
DECLARE @start_time datetime = CURRENT_TIMESTAMP,
        @c_number int

-- update each record (one by one) and set the random number based on 'the next Rand()' value
-- => the order of the records drives the distribution of the Rand() value !

-- seed
SELECT @c_number = Rand(0) 

-- update 1 by 1 
DECLARE cursor_no_transaction CURSOR LOCAL STATIC
    FOR SELECT Number
          FROM #test
         ORDER BY Number
OPEN cursor_no_transaction 
FETCH NEXT FROM cursor_no_transaction INTO @c_number
WHILE @@FETCH_STATUS = 0
    BEGIN
        UPDATE #test 
           SET random = Rand()
         WHERE Number = @c_number

        FETCH NEXT FROM cursor_no_transaction INTO @c_number
    END
CLOSE cursor_no_transaction 
DEALLOCATE cursor_no_transaction 

PRINT 'Time needed (no transaction) : ' + Convert(nvarchar(100), DateDiff(ms, @start_time, CURRENT_TIMESTAMP)) + ' ms.'

SELECT _avg = AVG(random), _stdev = STDEV(random) FROM #test

GO

DECLARE @start_time datetime = CURRENT_TIMESTAMP,
        @c_number int

BEGIN TRANSACTION

-- update each record (one by one) and set the random number based on 'the next Rand()' value
-- => the order of the records drives the distribution of the Rand() value !

-- seed
SELECT @c_number = Rand(0) 

-- update 1 by 1 but all of it inside 1 single transaction
DECLARE cursor_single_transaction CURSOR LOCAL STATIC
    FOR SELECT Number
          FROM #test
         ORDER BY Number
OPEN cursor_single_transaction 
FETCH NEXT FROM cursor_single_transaction INTO @c_number
WHILE @@FETCH_STATUS = 0
    BEGIN
        UPDATE #test 
           SET random = Rand()
         WHERE Number = @c_number

        FETCH NEXT FROM cursor_single_transaction INTO @c_number
    END
CLOSE cursor_single_transaction 
DEALLOCATE cursor_single_transaction 

COMMIT TRANSACTION

PRINT 'Time needed (single transaction) : ' + Convert(nvarchar(100), DateDiff(ms, @start_time, CURRENT_TIMESTAMP)) + ' ms.'

SELECT _avg = AVG(random), _stdev = STDEV(random) FROM #test

GO

DECLARE @start_time datetime = CURRENT_TIMESTAMP

-- update each record (single operation), use the Number column to reseed the Rand() function for every record
UPDATE #test 
    SET random = Rand(Number)

PRINT 'Time needed Rand(Number) : ' + Convert(nvarchar(100), DateDiff(ms, @start_time, CURRENT_TIMESTAMP)) + ' ms.'

SELECT _avg = AVG(random), _stdev = STDEV(random) FROM #test

GO

DECLARE @start_time datetime = CURRENT_TIMESTAMP

-- update each record (single operation), use 'a bunch of fields' to reseed the Rand() function for every record
UPDATE #test 
    SET random = Rand(BINARY_CHECKSUM(Number, blah))

PRINT 'Time needed Rand(BINARY_CHECKSUM(Number, blah)) : ' + Convert(nvarchar(100), DateDiff(ms, @start_time, CURRENT_TIMESTAMP)) + ' ms.'

SELECT _avg = AVG(random), _stdev = STDEV(random) FROM #test

The outcomes are more or less as expected:

Time needed (no transaction) : 24570 ms.
_avg                   _stdev
---------------------- ----------------------
0.499630943538644      0.288686960086461

Time needed (single transaction) : 14813 ms.
_avg                   _stdev
---------------------- ----------------------
0.499630943538646      0.288686960086461

Time needed Rand(Number) : 1203 ms.
_avg                   _stdev
---------------------- ----------------------
0.499407423620328      0.291093824839539

Time needed Rand(BINARY_CHECKSUM(Number, blah)) : 1250 ms.
_avg                   _stdev
---------------------- ----------------------
0.499715398881586      0.288579510523627

All of these are 'repeatable', question is if 'repeatable' means what you want it to mean here. I've stuck to AVG() and STDEV() to get a coarse idea of the distribution, I'll leave it up to you to see if they actually fit the bill (and if not, how to improve upon it =)

1.2 seconds for 1 million rows doesn't sound too bad for 1 million of rows IMHO. That said, if your table contains extra columns it will take up more space and hence take more time!

Hope this gets you started...

like image 129
deroby Avatar answered Nov 05 '22 10:11

deroby


The Rand() function can be seeded at the start by passing it an integer seed value. If you do this once before generating any random numbers, the sequence of random numbers will be repeatable. Generating the values individually will ensure the Rand() function returns the numbers in sequence. The following will produce a uniform distribution of n pseudo-random numbers with mean=0 and standard deviation=1:

    DECLARE @Mean    FLOAT = 0.0; 
    DECLARE @stDev   FLOAT = 1.0; 
    DECLARE @n   INT = 100000;   -- count of random numbers to generate
    DECLARE @U   TABLE(x FLOAT); -- table of random numbers

    DECLARE @SEED    INT = 123456;    -- seed to ensure list is reproducible
    SELECT RAND(@Seed);

    SET NOCOUNT ON;
    BEGIN TRAN
    DECLARE @x INT = 0; -- counter
    WHILE @x < @n
      BEGIN
      INSERT INTO @U (x)
        SELECT @Mean + (2 * SQRT(3) * @stDev) * (RAND() - 0.5)
      SET @x = @x + 1;
      END;
    COMMIT

-- Check the results    
    SELECT * from @U;

    SELECT AVG([@U].x) AS mean,
        STDEV([@U].x) AS stDev
        FROM @U;

Instead of inserting into a temporary table in a while loop, you could loop through the records in your existing table using a cursor and do an Update on each record. As mentioned in the comments, performance could be an issue, but it meets the requirements "uniform distribution with mean=0 and a standard devation=1" and "reproducibility". The way the Rand() function works forces the "1 by 1" update.

Below is an alternative that will have much better performance (should run in under 2 seconds with 1 million rows) with a replacement for the Rand() function. This allows the records to be updated in a single UPDATE but relies on a unique numeric ID field in your table and updates a field called RandomNumber. The Rand() function is replaced by ( (ID * @SEED ) % 1000 ) / 1000 which can probably be improved upon.

DECLARE @Mean    FLOAT = 0.0; 
DECLARE @stDev   FLOAT = 1.0; 
DECLARE @SEED numeric(18,0)    = 1234567890.0;    -- seed to ensure list is reproducible

SET NOCOUNT ON;
BEGIN TRAN
UPDATE TestTable
   set Randomnumber = @Mean + (2 * SQRT(3) * @stDev) * (( (ID * @SEED ) % 1000 ) / 1000 - 0.5) 
COMMIT
-- Check the results    
SELECT AVG(RandomNumber) AS mean,
    STDEV(RandomNumber ) AS stDev
    FROM TestTable;
like image 31
chrisuae Avatar answered Nov 05 '22 11:11

chrisuae


DECLARE @userReportId BIGINT
SET @userReportId = FLOOR(RAND()*(10000000000000-1) + 1);
like image 22
Spythere Avatar answered Nov 05 '22 10:11

Spythere


Repeatable random numbers are - quite probably - needed to repeat a situation, where tests went wrong in order to reproduce the circumstances of an exception.

The following suggestion will fill a physical table (add indexes!) with a position and a random number.

Use this list with a simple join to connect each of your rows with a random number.

Every call will bind the same random number to a given row.

Changing this can be done by re-positioning the randoms with a new random position (or you truncate-refill or drop-recreate the table).

This should be pretty fast...

CREATE TABLE dbo.MyRepeatableRandoms(CurrentPosition BIGINT,RandomNumber BIGINT);
GO
DECLARE @CountOfNumbers INT=5; --set a fitting max count here
WITH Tally AS
(
 SELECT TOP(@CountOfNumbers) ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS Nr 
 FROM master..spt_values 
      CROSS JOIN master..spt_values X 
      CROSS JOIN master..spt_values Y
)
INSERT INTO dbo.MyRepeatableRandoms
SELECT Nr,CAST(CAST(NEWID() AS VARBINARY(8)) AS BIGINT) FROM Tally;

--Use this list with a simple join to bind it to the rows of your table
SELECT * FROM dbo.MyRepeatableRandoms ORDER BY CurrentPosition;

--Re-Position the list
WITH UpdateableCTE AS
(
 SELECT ROW_NUMBER() OVER(ORDER BY A.posOrder) AS NewPos
       ,CurrentPosition
 FROM dbo.MyRepeatableRandoms
 CROSS APPLY(SELECT NEWID() AS posOrder) AS A
)
UPDATE UpdateableCTE SET CurrentPosition=NewPos;

--The same random numbers at new positions
SELECT * FROM MyRepeatableRandoms ORDER BY CurrentPosition;

GO
DROP TABLE dbo.MyRepeatableRandoms

The result

    RandomNumber
1   -1939965404062448822
2   2786711671511266125
3   -3236707863137400753
4   -6029509773149087675
5   7815987559555455297

After re-positioning

    RandomNumber
1   7815987559555455297
2   -1939965404062448822
3   2786711671511266125
4   -6029509773149087675
5   -3236707863137400753
like image 40
Shnugo Avatar answered Nov 05 '22 09:11

Shnugo