Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Estimating the digits occurrence probability inside a GUID

Recently I decided to investigate the degree of randomness of a globally unique identifier generated with the Guid.NewGuid method (which is also the scope of this question). I documented myself about pseudorandom numbers, pseudorandomness and I was dazzled to find out that there are even random numbers generated by radioactive decay. Anyway, I'll let you discover by yourselves more details about such interesting lectures.

To continue to my question, another important thing to know about GUID's is:

V1 GUIDs which contain a MAC address and time can be identified by the digit "1" in the first position of the third group of digits, for example {2F1E4FC0-81FD-11DA-9156-00036A0F876A}.

V4 GUIDs use the later algorithm, which is a pseudo-random number. These have a "4" in the same position, for example {38A52BE4-9352-453E-AF97-5C3B448652F0}.

To put it in a sentence, a Guid will always have the digit 4 (or 1, but out of our scope) as one of its components.

For my GUID randomness tests I decided to count the number of digits inside some increasingly large collection of GUIDs and compare it with the statistical probability of digit occurrence, expectedOccurrence. Or at least I hope I did (please excuse any statistical formula mistakes, I only tried my best guesses to calculate the values). I used a small C# console application which is listed below.

class Program
{
    static char[] digitsChar = "0123456789".ToCharArray();
    static decimal expectedOccurrence = (10M * 100 / 16) * 31 / 32 + (100M / 32);
    static void Main(string[] args)
    {
        for (int i = 1; i <= 10; i++)
        {
            CalculateOccurrence(i);
        }
    }

    private static void CalculateOccurrence(int counter)
    {
        decimal sum = 0;
        var sBuilder = new StringBuilder();
        int localCounter = counter * 20000;
        for (int i = 0; i < localCounter; i++)
        {
            sBuilder.Append(Guid.NewGuid());
        }

        sum = (sBuilder.ToString()).ToCharArray()
                  .Count(j => digitsChar.Contains(j));

        decimal actualLocalOccurrence = sum * 100 / (localCounter * 32);

        Console.WriteLine(String.Format("{0}\t{1}",
            expectedOccurrence,
            Math.Round(actualLocalOccurrence,3)
            ));
    }
}

The output for the above program is:

63.671875       63.273
63.671875       63.300
63.671875       63.331
63.671875       63.242
63.671875       63.292
63.671875       63.269
63.671875       63.292
63.671875       63.266
63.671875       63.254
63.671875       63.279

So, even if the theoretical occurrence is expected to be 63.671875%, the actual values are somewhere around ~63.2%.

How can this difference be explained? Is there any error in my formulas? Is there any other "obscure" rule in the Guid algorithm?

like image 262
Alex Filipovici Avatar asked Jan 30 '13 02:01

Alex Filipovici


2 Answers

In the version 4 GUID, the first character in the third group is 4. The first character in the fourth group is one of 8, 9, a, or b. The specification does not say anything about how that first character in the fourth group is generated. That could be throwing off your results.

If you want to investigate further, you need to keep track of how often each hexadecimal digit appears in each position. I suspect that will reveal the difference, and help you to determine whether your theoretical estimate is off, or the pseudo-random algorithm is slightly biased.

like image 156
Jim Mischel Avatar answered Oct 06 '22 05:10

Jim Mischel


Jim got it (I just found this question whose answer that gave the same incite into v4 guid generation).

So by altering the expected equation with this new knowledge, you get: ((10/16)*30+1+0.5)/32 or (10M * 100 / 16) * 30 / 32 + (150M / 32), which is about 63.28%, pretty close to the experimental data you were getting.

like image 37
Cemafor Avatar answered Oct 06 '22 04:10

Cemafor