I would like to generate a random value from the uniform distribution with mean=0 and a standard devation=1 for every row of a given data table in T-SQL. Additionally, I would like to set a seed in order to ensure reproducibility of analysis. Here are the ideas which did not work:
Using the function RAND()
with a declared number does not fulfill this objective: the same random value is generated for every row of the data set.
Such a solution:
SELECT ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)) AS [RandomNumber]
does not solve the problem either since it is not reproducible.
EDIT:
The performance does matter since my table has hundreds of millions of records.
The inversion method relies on the principle that continuous cumulative distribution functions (cdfs) range uniformly over the open interval (0,1). If u is a uniform random number on (0,1), then x = F - 1 ( u ) generates a random number x from any continuous distribution with the specified cdf F .
SQL Server RAND() Function The RAND() function returns a random number between 0 (inclusive) and 1 (exclusive).
There are several methods to generate a random number. One simple method is to make use of RAND() function available in SQL Server. RAND() function simply generate a decimal number between 0 (inclusive) and 1 (exclusive).
The main question here IMHO is how you see 'repeatabililty'? Or asked differently: what 'drives' the randomness? I can envision a solution that sticks the same random number to every record for each run as long as the data doesn't change. However, what do you expect to happen if the data changes?
For the fun of it, I did the following tests on a (not very representative) test-table with 1 million of rows:
-- seed
SELECT Rand(0)
-- will show the same random number for EVERY record
SELECT Number, blah = Convert(varchar(100), NewID()), random = Rand()
INTO #test
FROM master.dbo.fn_int_list(1, 1000000)
CREATE UNIQUE CLUSTERED INDEX uq0_test ON #test (Number)
SET NOCOUNT ON
GO
DECLARE @start_time datetime = CURRENT_TIMESTAMP,
@c_number int
-- update each record (one by one) and set the random number based on 'the next Rand()' value
-- => the order of the records drives the distribution of the Rand() value !
-- seed
SELECT @c_number = Rand(0)
-- update 1 by 1
DECLARE cursor_no_transaction CURSOR LOCAL STATIC
FOR SELECT Number
FROM #test
ORDER BY Number
OPEN cursor_no_transaction
FETCH NEXT FROM cursor_no_transaction INTO @c_number
WHILE @@FETCH_STATUS = 0
BEGIN
UPDATE #test
SET random = Rand()
WHERE Number = @c_number
FETCH NEXT FROM cursor_no_transaction INTO @c_number
END
CLOSE cursor_no_transaction
DEALLOCATE cursor_no_transaction
PRINT 'Time needed (no transaction) : ' + Convert(nvarchar(100), DateDiff(ms, @start_time, CURRENT_TIMESTAMP)) + ' ms.'
SELECT _avg = AVG(random), _stdev = STDEV(random) FROM #test
GO
DECLARE @start_time datetime = CURRENT_TIMESTAMP,
@c_number int
BEGIN TRANSACTION
-- update each record (one by one) and set the random number based on 'the next Rand()' value
-- => the order of the records drives the distribution of the Rand() value !
-- seed
SELECT @c_number = Rand(0)
-- update 1 by 1 but all of it inside 1 single transaction
DECLARE cursor_single_transaction CURSOR LOCAL STATIC
FOR SELECT Number
FROM #test
ORDER BY Number
OPEN cursor_single_transaction
FETCH NEXT FROM cursor_single_transaction INTO @c_number
WHILE @@FETCH_STATUS = 0
BEGIN
UPDATE #test
SET random = Rand()
WHERE Number = @c_number
FETCH NEXT FROM cursor_single_transaction INTO @c_number
END
CLOSE cursor_single_transaction
DEALLOCATE cursor_single_transaction
COMMIT TRANSACTION
PRINT 'Time needed (single transaction) : ' + Convert(nvarchar(100), DateDiff(ms, @start_time, CURRENT_TIMESTAMP)) + ' ms.'
SELECT _avg = AVG(random), _stdev = STDEV(random) FROM #test
GO
DECLARE @start_time datetime = CURRENT_TIMESTAMP
-- update each record (single operation), use the Number column to reseed the Rand() function for every record
UPDATE #test
SET random = Rand(Number)
PRINT 'Time needed Rand(Number) : ' + Convert(nvarchar(100), DateDiff(ms, @start_time, CURRENT_TIMESTAMP)) + ' ms.'
SELECT _avg = AVG(random), _stdev = STDEV(random) FROM #test
GO
DECLARE @start_time datetime = CURRENT_TIMESTAMP
-- update each record (single operation), use 'a bunch of fields' to reseed the Rand() function for every record
UPDATE #test
SET random = Rand(BINARY_CHECKSUM(Number, blah))
PRINT 'Time needed Rand(BINARY_CHECKSUM(Number, blah)) : ' + Convert(nvarchar(100), DateDiff(ms, @start_time, CURRENT_TIMESTAMP)) + ' ms.'
SELECT _avg = AVG(random), _stdev = STDEV(random) FROM #test
The outcomes are more or less as expected:
Time needed (no transaction) : 24570 ms.
_avg _stdev
---------------------- ----------------------
0.499630943538644 0.288686960086461
Time needed (single transaction) : 14813 ms.
_avg _stdev
---------------------- ----------------------
0.499630943538646 0.288686960086461
Time needed Rand(Number) : 1203 ms.
_avg _stdev
---------------------- ----------------------
0.499407423620328 0.291093824839539
Time needed Rand(BINARY_CHECKSUM(Number, blah)) : 1250 ms.
_avg _stdev
---------------------- ----------------------
0.499715398881586 0.288579510523627
All of these are 'repeatable', question is if 'repeatable' means what you want it to mean here. I've stuck to AVG() and STDEV() to get a coarse idea of the distribution, I'll leave it up to you to see if they actually fit the bill (and if not, how to improve upon it =)
1.2 seconds for 1 million rows doesn't sound too bad for 1 million of rows IMHO. That said, if your table contains extra columns it will take up more space and hence take more time!
Hope this gets you started...
The Rand() function can be seeded at the start by passing it an integer seed value. If you do this once before generating any random numbers, the sequence of random numbers will be repeatable. Generating the values individually will ensure the Rand() function returns the numbers in sequence. The following will produce a uniform distribution of n pseudo-random numbers with mean=0 and standard deviation=1:
DECLARE @Mean FLOAT = 0.0;
DECLARE @stDev FLOAT = 1.0;
DECLARE @n INT = 100000; -- count of random numbers to generate
DECLARE @U TABLE(x FLOAT); -- table of random numbers
DECLARE @SEED INT = 123456; -- seed to ensure list is reproducible
SELECT RAND(@Seed);
SET NOCOUNT ON;
BEGIN TRAN
DECLARE @x INT = 0; -- counter
WHILE @x < @n
BEGIN
INSERT INTO @U (x)
SELECT @Mean + (2 * SQRT(3) * @stDev) * (RAND() - 0.5)
SET @x = @x + 1;
END;
COMMIT
-- Check the results
SELECT * from @U;
SELECT AVG([@U].x) AS mean,
STDEV([@U].x) AS stDev
FROM @U;
Instead of inserting into a temporary table in a while loop, you could loop through the records in your existing table using a cursor and do an Update on each record. As mentioned in the comments, performance could be an issue, but it meets the requirements "uniform distribution with mean=0 and a standard devation=1" and "reproducibility". The way the Rand() function works forces the "1 by 1" update.
Below is an alternative that will have much better performance (should run in under 2 seconds with 1 million rows) with a replacement for the Rand() function. This allows the records to be updated in a single UPDATE
but relies on a unique numeric ID
field in your table and updates a field called RandomNumber
. The Rand() function is replaced by ( (ID * @SEED ) % 1000 ) / 1000
which can probably be improved upon.
DECLARE @Mean FLOAT = 0.0;
DECLARE @stDev FLOAT = 1.0;
DECLARE @SEED numeric(18,0) = 1234567890.0; -- seed to ensure list is reproducible
SET NOCOUNT ON;
BEGIN TRAN
UPDATE TestTable
set Randomnumber = @Mean + (2 * SQRT(3) * @stDev) * (( (ID * @SEED ) % 1000 ) / 1000 - 0.5)
COMMIT
-- Check the results
SELECT AVG(RandomNumber) AS mean,
STDEV(RandomNumber ) AS stDev
FROM TestTable;
DECLARE @userReportId BIGINT
SET @userReportId = FLOOR(RAND()*(10000000000000-1) + 1);
Repeatable random numbers are - quite probably - needed to repeat a situation, where tests went wrong in order to reproduce the circumstances of an exception.
The following suggestion will fill a physical table (add indexes!) with a position and a random number.
Use this list with a simple join to connect each of your rows with a random number.
Every call will bind the same random number to a given row.
Changing this can be done by re-positioning the randoms with a new random position (or you truncate-refill or drop-recreate the table).
This should be pretty fast...
CREATE TABLE dbo.MyRepeatableRandoms(CurrentPosition BIGINT,RandomNumber BIGINT);
GO
DECLARE @CountOfNumbers INT=5; --set a fitting max count here
WITH Tally AS
(
SELECT TOP(@CountOfNumbers) ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS Nr
FROM master..spt_values
CROSS JOIN master..spt_values X
CROSS JOIN master..spt_values Y
)
INSERT INTO dbo.MyRepeatableRandoms
SELECT Nr,CAST(CAST(NEWID() AS VARBINARY(8)) AS BIGINT) FROM Tally;
--Use this list with a simple join to bind it to the rows of your table
SELECT * FROM dbo.MyRepeatableRandoms ORDER BY CurrentPosition;
--Re-Position the list
WITH UpdateableCTE AS
(
SELECT ROW_NUMBER() OVER(ORDER BY A.posOrder) AS NewPos
,CurrentPosition
FROM dbo.MyRepeatableRandoms
CROSS APPLY(SELECT NEWID() AS posOrder) AS A
)
UPDATE UpdateableCTE SET CurrentPosition=NewPos;
--The same random numbers at new positions
SELECT * FROM MyRepeatableRandoms ORDER BY CurrentPosition;
GO
DROP TABLE dbo.MyRepeatableRandoms
The result
RandomNumber
1 -1939965404062448822
2 2786711671511266125
3 -3236707863137400753
4 -6029509773149087675
5 7815987559555455297
After re-positioning
RandomNumber
1 7815987559555455297
2 -1939965404062448822
3 2786711671511266125
4 -6029509773149087675
5 -3236707863137400753
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With