Recently I decided to investigate the degree of randomness of a globally unique identifier generated with the Guid.NewGuid
method (which is also the scope of this question). I documented myself about pseudorandom numbers, pseudorandomness and I was dazzled to find out that there are even random numbers generated by radioactive decay. Anyway, I'll let you discover by yourselves more details about such interesting lectures.
To continue to my question, another important thing to know about GUID's is:
V1 GUIDs which contain a MAC address and time can be identified by the digit "1" in the first position of the third group of digits, for example {2F1E4FC0-81FD-11DA-9156-00036A0F876A}.
V4 GUIDs use the later algorithm, which is a pseudo-random number. These have a "4" in the same position, for example {38A52BE4-9352-453E-AF97-5C3B448652F0}.
To put it in a sentence, a Guid
will always have the digit 4 (or 1, but out of our scope) as one of its components.
For my GUID randomness tests I decided to count the number of digits inside some increasingly large collection of GUIDs and compare it with the statistical probability of digit occurrence, expectedOccurrence
. Or at least I hope I did (please excuse any statistical formula mistakes, I only tried my best guesses to calculate the values). I used a small C#
console application which is listed below.
class Program
{
static char[] digitsChar = "0123456789".ToCharArray();
static decimal expectedOccurrence = (10M * 100 / 16) * 31 / 32 + (100M / 32);
static void Main(string[] args)
{
for (int i = 1; i <= 10; i++)
{
CalculateOccurrence(i);
}
}
private static void CalculateOccurrence(int counter)
{
decimal sum = 0;
var sBuilder = new StringBuilder();
int localCounter = counter * 20000;
for (int i = 0; i < localCounter; i++)
{
sBuilder.Append(Guid.NewGuid());
}
sum = (sBuilder.ToString()).ToCharArray()
.Count(j => digitsChar.Contains(j));
decimal actualLocalOccurrence = sum * 100 / (localCounter * 32);
Console.WriteLine(String.Format("{0}\t{1}",
expectedOccurrence,
Math.Round(actualLocalOccurrence,3)
));
}
}
The output for the above program is:
63.671875 63.273
63.671875 63.300
63.671875 63.331
63.671875 63.242
63.671875 63.292
63.671875 63.269
63.671875 63.292
63.671875 63.266
63.671875 63.254
63.671875 63.279
So, even if the theoretical occurrence is expected to be 63.671875%
, the actual values are somewhere around ~63.2%
.
How can this difference be explained? Is there any error in my formulas? Is there any other "obscure" rule in the Guid
algorithm?
In the version 4 GUID, the first character in the third group is 4
. The first character in the fourth group is one of 8
, 9
, a
, or b
. The specification does not say anything about how that first character in the fourth group is generated. That could be throwing off your results.
If you want to investigate further, you need to keep track of how often each hexadecimal digit appears in each position. I suspect that will reveal the difference, and help you to determine whether your theoretical estimate is off, or the pseudo-random algorithm is slightly biased.
Jim got it (I just found this question whose answer that gave the same incite into v4 guid generation).
So by altering the expected equation with this new knowledge, you get: ((10/16)*30+1+0.5)/32
or (10M * 100 / 16) * 30 / 32 + (150M / 32)
, which is about 63.28%, pretty close to the experimental data you were getting.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With