Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Distinct random time generation in the fixed interval

I'm trying to generate a random time between 8:00 AM and 8:00 PM for each row that is selected from a data set, however, I always get the same random value for each row – I want it to be different for each row.

Table schema & data:

╔══════╦════════════════╗
║  ID  ║  CREATED_DATE  ║
╠══════╬════════════════╣
║ ID/1 ║   26/04/2014   ║
║ ID/2 ║   26/04/2014   ║
║ ID/3 ║   26/04/2014   ║
║ ID/4 ║   26/04/2014   ║
║ ID/5 ║   26/04/2014   ║
╚══════╩════════════════╝

Сurrent SQL statement:

SELECT [ID]
     , MyFunction.dbo.AddWorkDays(14, [CREATED_DATE]) AS [New Date]
     , CONVERT(VARCHAR, DATEADD(MILLISECOND, CAST(43200000 * RAND() AS INT), CONVERT(TIME, '08:00')), 114) AS [New Time]
FROM [RandomTable]

Current results (same time for each row in the [New Time] column):

╔══════╦════════════════╦════════════════╗
║  ID  ║    New Date    ║    New Time    ║
╠══════╬════════════════╬════════════════╣
║ ID/1 ║   10/05/2014   ║    09:41:43    ║
║ ID/2 ║   10/05/2014   ║    09:41:43    ║
║ ID/3 ║   10/05/2014   ║    09:41:43    ║
║ ID/4 ║   10/05/2014   ║    09:41:43    ║
║ ID/5 ║   10/05/2014   ║    09:41:43    ║
╚══════╩════════════════╩════════════════╝

Desired results (different time for each row in the [New Time] column):

╔══════╦════════════════╦════════════════╗
║  ID  ║    New Date    ║    New Time    ║
╠══════╬════════════════╬════════════════╣
║ ID/1 ║   10/05/2014   ║    09:41:43    ║
║ ID/2 ║   10/05/2014   ║    15:05:23    ║
║ ID/3 ║   10/05/2014   ║    10:01:05    ║
║ ID/4 ║   10/05/2014   ║    19:32:45    ║
║ ID/5 ║   10/05/2014   ║    08:43:15    ║
╚══════╩════════════════╩════════════════╝

Any ideas on how to fix this? All of the above is just sample data – my real table has around 2800 records (not sure if that will make a difference to anyone's suggestions).

like image 800
AMC Avatar asked Apr 26 '14 17:04

AMC


4 Answers

The issue OP had while using just rand() is due to it's evaluation once per query.

From the documentation:

If seed is not specified, the SQL Server Database Engine assigns a seed value at random. For a specified seed value, the result returned is always the same.

Approach that is described below removes optimization and suppresses this behavior, so rand() is evaluated once per row:

dateadd( second
       , rand(cast(newid() as varbinary)) * 43200
       , cast('08:00:00' as time) )
  • newid() generates unique value of type uniqueidentifier;
  • the value is converted with cast to be used as seed in rand([seed]) function to generate a pseudo-random float value from 0 through 1, and as seed is always unique the returning value is unique too .

SQLFiddle

like image 100
potashin Avatar answered Nov 09 '22 05:11

potashin


You could alternatively use:

SELECT DATEADD(s, ABS(CHECKSUM(NewId()) % 43201), CAST('08:00:00' AS Time))

The ABS(CHECKSUM(NewId()) % 43201) generates a random number between 0 and 43200. See Discussion here.

SQL Fiddle

MS SQL Server 2008 Schema Setup:

Query 1:

SELECT DATEADD(s, ABS(CHECKSUM(NewId()) % 43201), CAST('08:00:00' AS Time)) AS [RandomTime]
FROM 
( VALUES (1), (2), (3), (4), (5)
) Y(A)
CROSS JOIN
( VALUES (1), (2), (3), (4), (5)
) Z(A)

Results:

|    RANDOMTIME    |
|------------------|
| 16:51:58.0000000 |
| 10:42:44.0000000 |
| 14:01:38.0000000 |
| 13:33:51.0000000 |
| 18:00:51.0000000 |
| 11:29:03.0000000 |
| 10:21:14.0000000 |
| 16:38:27.0000000 |
| 09:55:37.0000000 |
| 13:21:13.0000000 |
| 11:29:37.0000000 |
| 10:57:49.0000000 |
| 14:56:42.0000000 |
| 15:33:11.0000000 |
| 18:49:45.0000000 |
| 16:23:28.0000000 |
| 09:00:05.0000000 |
| 09:20:01.0000000 |
| 11:26:23.0000000 |
| 15:26:23.0000000 |
| 10:38:44.0000000 |
| 11:46:30.0000000 |
| 16:00:59.0000000 |
| 09:29:18.0000000 |
| 09:09:19.0000000 |
like image 43
Steve Ford Avatar answered Nov 09 '22 05:11

Steve Ford


Interpretation of Original Question:

The question states:

  • Generate a random time between 8:00 AM and 8:00 PM (i.e. a 12-hour window)
  • It should be different for each row (i.e. unique across all rows)
  • The real table has around 2800 records

Now factor in the following points:

  • Sample data shows only a single date
  • There are 86,400 seconds in 24 hours, hence 43,200 seconds in 12 hours

There is some ambiguity in the following areas:

  • What exactly is random within the context of "different for every row", given that truly random values cannot be guaranteed to be different for every row. In fact, truly random numbers could theoretically be the same for every row. So is the emphasis on "random" or "different"? Or are we really talking about different but not sequentially ordered (to give the appearance of randomness without actually being random)?
  • What if there are ever more than 2800 rows? What if there are 1 million rows?
  • If there can be more than 43,200 rows, how to handle "different for each row" (since it is not possible to have unique across all rows)?
  • Will the date ever vary? If so, are we really talking about "different for each row per date"?
  • If "different for each row per date":
    • Can the times for each date follow the same, non-sequential pattern? Or does the pattern need to differ per each date?
    • Will there ever be more than 43,200 rows for any particular date? If so, the times can only be unique per each set of 43,200 rows.

Given the information above, there are a few ways to interpret the request:

  1. Emphasis on "random": Dates and number of rows don't matter. Generate truly random times that are highly likely, but not guaranteed, to be unique using one of the three methods shown in the other answers:
    • @notulysses: RAND(CAST(NEWID() AS VARBINARY)) * 43200
    • @Steve Ford: ABS(CHECKSUM(NewId()) % 43201)
    • @Vladimir Baranov : CAST(43200000 * (CAST(CRYPT_GEN_RANDOM(4) as int) / 4294967295.0 + 0.5) as int)
  2. Emphasis on "different for each row", always <= 43,200 rows: If the number of rows never exceeds the number of available seconds, it is easy to guarantee unique times across all rows, regardless of same or different dates, and appear to be randomly ordered.
  3. Emphasis on "different for each row", could be > 43,200 rows: If the number of rows can exceed the number of available seconds, then it is not possible to guarantee uniqueness across all rows, but it would be possible to still guarantee uniqueness across rows of any particular date, provided that no particular date has > 43,200 rows.

Hence, I based my answer on the idea that:

  • Even if the number of rows for the O.P. never exceeds 2800, it is more likely that most others who are encountering a similar need for randomness would have a larger data set to work with (i.e. there could easily be 1 million rows, for any number of dates: 1, 5000, etc.)
  • Either the sample data is overly simplistic in using the same date for all 5 rows, or even if the date is the same for all rows in this particular case, in most other cases that is less likely to happen
  • Uniqueness is to be favored over Randomness
  • If there is a pattern to the "seemingly random" ordering of the seconds for each date, there should at least be a varying offset to the start of the sequence across the dates (when the dates are ordered sequentially) to give the appearance of randomness between any small grouping of dates.

Answer:

If the situation requires unique times, that cannot be guaranteed with any method of generating truly random values. I really like the use of CRYPT_GEN_RANDOM by @Vladimir Baranov, but it is nearly impossible to get a unique set of values generated:

DECLARE @Table TABLE (Col1 BIGINT NOT NULL UNIQUE);

INSERT INTO @Table (Col1)
    SELECT CONVERT(BIGINT, CRYPT_GEN_RANDOM(4))
    FROM [master].sys.objects so
    CROSS JOIN [master].sys.objects so2
    CROSS JOIN [master].sys.objects so3;
    -- 753,571 rows

Increasing the random value to 8 bytes does seem to work:

DECLARE @Table TABLE (Col1 BIGINT NOT NULL UNIQUE);

INSERT INTO @Table (Col1)
    SELECT CONVERT(BIGINT, CRYPT_GEN_RANDOM(8))
    FROM [master].sys.objects so
    CROSS JOIN [master].sys.objects so2
    CROSS JOIN [master].sys.objects so3;
    -- 753,571 rows

Of course, if we are generating down to the second, then there are only 86,400 of those. Reducing the scope seems to help as the following does occasionally work:

DECLARE @Table TABLE (Col1 BIGINT NOT NULL UNIQUE);

INSERT INTO @Table (Col1)
    SELECT TOP (86400) CONVERT(BIGINT, CRYPT_GEN_RANDOM(4))
    FROM [master].sys.objects so
    CROSS JOIN [master].sys.objects so2
    CROSS JOIN [master].sys.objects so3;

However, things get a bit trickier if the uniqueness needs per each day (which seems like a reasonable requirement of this type of project, as opposed to unique across all days). But a random number generator isn't going to know to reset at each new day.

If it is acceptable to merely have the appearance of being random, then we can guarantee uniqueness per each date without:

  • looping / cursor constructs
  • saving already used values in a table
  • using RAND(), NEWID(), or CRYPT_GEN_RANDOM()

The following solution uses the concept of Modular Multiplicative Inverses (MMI) which I learned about in this answer: generate seemingly random unique numeric ID in SQL Server . Of course, that question did not have a tightly-defined range of values like we have here with only 86,400 of them per day. So, I used a range of 86400 (as "Modulo") and tried a few "coprime" values (as "Integer") in an online calculator to get their MMIs:

  • 13 (MMI = 39877)
  • 37 (MMI = 51373)
  • 59 (MMI = 39539)

I use ROW_NUMBER() in a CTE, partitioned (i.e. grouped) by CREATED_DATE as a means of assigning each second of the day a value.

But, while the values generated for seconds 0, 1, 2, ... and so on sequentially will appear random, across different days that particular second will map to the same value. So, the second CTE (named "WhichSecond") shifts the starting point for each date by converting the date to an INT (which converts dates to a sequential offset from 1900-01-01) and then multiply by 101.

DECLARE @Data TABLE
(
  ID INT NOT NULL IDENTITY(1, 1),
  CREATED_DATE DATE NOT NULL
);

INSERT INTO @Data (CREATED_DATE) VALUES ('2014-10-05');
INSERT INTO @Data (CREATED_DATE) VALUES ('2014-10-05');
INSERT INTO @Data (CREATED_DATE) VALUES ('2014-10-05');
INSERT INTO @Data (CREATED_DATE) VALUES ('2014-10-05');
INSERT INTO @Data (CREATED_DATE) VALUES ('2014-10-05');
INSERT INTO @Data (CREATED_DATE) VALUES ('2015-03-15');
INSERT INTO @Data (CREATED_DATE) VALUES ('2016-10-22');
INSERT INTO @Data (CREATED_DATE) VALUES ('2015-03-15');

;WITH cte AS
(
  SELECT tmp.ID,
         CONVERT(DATETIME, tmp.CREATED_DATE) AS [CREATED_DATE],
         ROW_NUMBER() OVER (PARTITION BY tmp.CREATED_DATE ORDER BY (SELECT NULL))
                      AS [RowNum]
  FROM   @Data tmp
), WhichSecond AS
(
  SELECT cte.ID,
         cte.CREATED_DATE,
         ((CONVERT(INT, cte.[CREATED_DATE]) - 29219) * 101) + cte.[RowNum]
                      AS [ThisSecond]
  FROM   cte
)
SELECT parts.*,
       (parts.ThisSecond % 86400) AS [NormalizedSecond], -- wrap around to 0 when
                                                         -- value goes above 86,400
       ((parts.ThisSecond % 86400) * 39539) % 86400 AS [ActualSecond],
       DATEADD(
                 SECOND,
                 (((parts.ThisSecond % 86400) * 39539) % 86400),
                 parts.CREATED_DATE
              ) AS [DateWithUniqueTime]
FROM WhichSecond parts
ORDER BY parts.ID;

Returns:

ID  CREATED_DATE  ThisSecond  NormalizedSecond  ActualSecond  DateWithUniqueTime
1   2014-10-05    1282297     72697             11483         2014-10-05 03:11:23.000
2   2014-10-05    1282298     72698             51022         2014-10-05 14:10:22.000
3   2014-10-05    1282299     72699              4161         2014-10-05 01:09:21.000
4   2014-10-05    1282300     72700             43700         2014-10-05 12:08:20.000
5   2014-10-05    1282301     72701             83239         2014-10-05 23:07:19.000
6   2015-03-15    1298558      2558             52762         2015-03-15 14:39:22.000
7   2016-10-22    1357845     61845             83055         2016-10-22 23:04:15.000
8   2015-03-15    1298559      2559              5901         2015-03-15 01:38:21.000

If we want to only generate times between 8:00 AM and 8:00 PM, we only need to make a few minor adjustments:

  1. Change the range (as "Modulo") from 86400 to half of it: 43200
  2. Recalculate the MMI (can use the same "coprime" values as "Integer"): 39539 (same as before)
  3. Add 28800 to the second parameter of the DATEADD as an 8 hour offset

The result will be a change to just one line (since the others are diagnostic):

-- second parameter of the DATEADD() call
28800 + (((parts.ThisSecond % 43200) * 39539) % 43200)

Another means of shifting each day in a less predictable fashion would be to make use of RAND() by passing in the INT form of CREATED_DATE in the "WhichSecond" CTE. This would give a stable offset per each date since RAND(x) will return the same value y for the same value of x passed in, but will return a different value y for a different value of x passed in. Meaning:

RAND(1) = y1
RAND(2) = y2
RAND(3) = y3
RAND(2) = y2

The second time RAND(2) was called, it still returned the same value of y2 that it returned the first time it was called.

Hence, the "WhichSecond" CTE could be:

(
  SELECT cte.ID,
         cte.CREATED_DATE,
         (RAND(CONVERT(INT, cte.[CREATED_DATE])) * {some number}) + cte.[RowNum]
                      AS [ThisSecond]
  FROM   cte
)
like image 15
Solomon Rutzky Avatar answered Nov 09 '22 05:11

Solomon Rutzky


There are several methods:

  • Generate a table with random numbers in advance and use it whenever needed. Or take this data from some reputable source.
  • Various combinations that use NEWID function to provide a seed for RAND. It should be used with caution, because there is no guarantee about distribution of NEWID values. One of the best methods to make it more or less uniformly distributed is via the CHECKSUM: RAND(CHECKSUM(NEWID())). The good thing about this method is that NEWID function is available since SQL Server 2000.
  • Instead of NEWID use, say, MD5 of some column as a seed for RAND: RAND(CHECKSUM(HASHBYTES('MD5', CAST(SomeID AS varbinary(4))))) Or simply row number: RAND(CHECKSUM(HASHBYTES('MD5', CAST(ROW_NUMBER() OVER(ORDER BY ...) AS varbinary(4))))). This method is available since at least SQL Server 2005. The primary difference from NEWID method is that you have full control over the random sequence. You can't control what NEWID returns and you can't restart the random sequence from the same number again. If you supply same sets of, say, row numbers using PARTITION BY you'll get same sets of random numbers. It may be useful in the cases when you need to use the same sequence of random numbers several times. It is possible to get the same random number for two different seeds. I tested it for row numbers from 1 to 1,000,000. MD5 of them are all different. CHECKSUM of MD5 result in 122 collisions. RAND of this CHECKSUM result in 246 collisions. When tested with row numbers from 1 to 100,000 CHECKSUM had 1 collision, RAND had 3 collisions.
  • Another possibility is to simply implement your own user-defined function in T-SQL that generates a random number using your preferred algorithm. In this case you have full control of everything. Usually pseudo random generators have to store their internal state between invocations, so you may end up with having a dedicated table that stores this data.
  • You can write your user-defined function using CLR. In this case you can implement your own generator, or use functions built-into .NET, like Random class, or RNGCryptoServiceProvider class.
  • At last, since SQL Server 2008 there is a built-in function CRYPT_GEN_RANDOM.

I will describe the last method in detail, because I think that it is a very good solution for SQL Server 2008 and above. CRYPT_GEN_RANDOM is called for each row of the result set, as opposed to RAND, which is called only once.

CRYPT_GEN_RANDOM (Transact-SQL)

Returns a cryptographic random number generated by the Crypto API (CAPI). The output is a hexadecimal number of the specified number of bytes.

Besides, CRYPT_GEN_RANDOM should provide much better random values, than RAND. Better in terms of distribution and crypto-strength. Example:

(CAST(CRYPT_GEN_RANDOM(4) as int) / 4294967295.0 + 0.5)

This generates 4 random bytes as varbinary. We have to explicitly cast them to int first. Then result is transformed into a float number between 0 and 1.

So, the original query would like this:

SELECT ID AS [ID]
     , MyFunction.dbo.AddWorkDays(14, S.CREATED_DATE) AS [New Date]
     , CONVERT(VARCHAR, DATEADD(MILLISECOND, 
     CAST(43200000 * (CAST(CRYPT_GEN_RANDOM(4) as int) / 4294967295.0 + 0.5) as int),
     CONVERT(TIME, '08:00')), 114) AS [New Time]
FROM RandomTable

Here is a stand-alone example that is easy to copy-paste and try (I used the query from another answer by @Steve Ford):

SELECT DATEADD(millisecond, 
    CAST(43200000 * (CAST(CRYPT_GEN_RANDOM(4) as int) / 4294967295.0 + 0.5) as int), 
    CAST('08:00:00' AS Time)) AS [RandomTime]
FROM 
    ( VALUES (1), (2), (3), (4), (5)
    ) Y(A)
    CROSS JOIN
    ( VALUES (1), (2), (3), (4), (5)
    ) Z(A)

This is the result:

RandomTime
10:58:24.7200000
19:40:06.7220000
11:04:29.0530000
08:57:31.6130000
15:03:14.9470000
09:15:34.9380000
13:46:43.1250000
11:27:00.8940000
14:42:23.6100000
15:07:56.2120000
11:39:09.8830000
08:16:44.3960000
14:23:38.4820000
17:28:31.7440000
16:29:31.4320000
09:09:15.0210000
12:31:09.8370000
11:23:09.8430000
15:35:45.5480000
17:42:49.3390000
08:07:05.4930000
18:17:16.2980000
11:49:08.2010000
10:20:21.7620000
15:56:58.6110000

Addition

When I read the original question I didn't think that it is really necessary to ensure that all generated random numbers are unique. I interpreted the word "different" in the question as a vague opposite to seeing the same number in each row of the result that you see when using a simple SELECT RAND(). I think that in many cases it doesn't matter if there are few colliding random numbers. In many cases it would actually be the correct behavior.

So, my understanding is that when there is a need of a sequence of unique random numbers, it is in a sense equivalent to the following task. We have a set of some values/rows, for example, a set of unique IDs or all 86400 seconds of a day or 2800 rows for a given day. We want to shuffle these values/rows. We want to rearrange these rows in a random order.

To shuffle the given set of rows we simply need to ORDER BY random numbers (these random numbers may have reasonable amount of collisions here). Random numbers could be generated by any method. Something like this:

ROW_NUMBER() OVER ([optional PARTITION BY ...] ORDER BY CRYPT_GEN_RANDOM(4)) 

or literally

SELECT ...
FROM ...
ORDER BY CRYPT_GEN_RANDOM(4)

depending on where and how it is used.

like image 8
Vladimir Baranov Avatar answered Nov 09 '22 04:11

Vladimir Baranov