Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Distribution of final digits of random numbers in Python

Tags:

python

random

There are two obvious ways to generate a random digit from 0 to 9 in Python. One could generate a random floating point number between 0 and 1, multiply by 10, and round down. Alternatively, one could use the random.randint method.

import random  def random_digit_1():     return int(10 * random.random())  def random_digit_2():     return random.randint(0, 9) 

I was curious about what would happen if one generated a random number between 0 and 1, and kept the last digit. I didn't necessarily expect the distribution to be uniform, but I found the result quite surprising.

from random import random, seed from collections import Counter  seed(0) counts = Counter(int(str(random())[-1]) for _ in range(1_000_000)) print(counts) 

Output:

Counter({1: 84206,          5: 130245,          3: 119433,          6: 129835,          8: 101488,          2: 100861,          9: 84796,          4: 129088,          7: 120048}) 

A histogram is shown below. Note that 0 does not appear, since trailing zeros are truncated. But can anyone explain why the digits 4, 5, and 6 are more common than the rest? I used Python 3.6.10, but the results were similar in Python 3.8.0a4.

Distribution of final digits of random floats

like image 376
Dave Radcliffe Avatar asked Apr 25 '20 01:04

Dave Radcliffe


People also ask

How do you generate a random distribution in Python?

Random integer values can be generated with the randint() function. This function takes two arguments: the start and the end of the range for the generated integer values. Random integers are generated within and including the start and end of range values, specifically in the interval [start, end].

How do you get uniformly distributed random numbers in Python?

uniform() method in Python Random module uniform() is a method specified in the random library in Python 3. Parameters : x Specifies the lower limit of the random number required to generate. y Specifies the upper limit of the random number required to generate.


2 Answers

That's not "the last digit" of the number. That's the last digit of the string str gave you when passed the number.

When you call str on a float, Python gives you enough digits that calling float on the string will give you the original float. For this purpose, a trailing 1 or 9 is less likely to be necessary than other digits, because a trailing 1 or 9 means the number is very close to the value you'd get by rounding off that digit. There's a good chance no other floats are closer, and if so, that digit can be discarded without sacrificing float(str(original_float)) behavior.

If str gave you enough digits to exactly represent the argument, the last digit would almost always be 5, except when random.random() returns 0.0, in which case the last digit would be 0. (Floats can only represent dyadic rationals, and the last nonzero decimal digit of a non-integer dyadic rational is always 5.) The outputs would also be extremely long, looking like

>>> import decimal, random >>> print(decimal.Decimal(random.random())) 0.29711195452007921335990658917580731213092803955078125 

which is one of the reasons str doesn't do that.

If str gave you exactly 17 significant digits (enough to distinguish all float values from each other, but sometimes more digits than necessary), then the effect you're seeing would disappear. There would be a nearly uniform distribution of trailing digits (including 0).

(Also, you forgot that str sometimes returns a string in scientific notation, but that's a minor effect, because there's a low probability of getting a float where that would happen out of random.random().)

like image 131
user2357112 supports Monica Avatar answered Oct 10 '22 07:10

user2357112 supports Monica


TL;DR Your example is not actually looking at the last digit. The last digit of a finite binary-represented mantissa converted to base-10 should always be 0 or 5.


Take a look at the comment in cpython/pystrtod.c:

char * PyOS_double_to_string(double val,                                          char format_code,                                          int precision,                                          int flags,                                          int *type) {     char format[32];     Py_ssize_t bufsize;     char *buf;     int t, exp;     int upper = 0;      /* Validate format_code, and map upper and lower case */     switch (format_code) {     // ...     case 'r':          /* repr format */         /* Supplied precision is unused, must be 0. */         if (precision != 0) {             PyErr_BadInternalCall();             return NULL;         }         /* The repr() precision (17 significant decimal digits) is the            minimal number that is guaranteed to have enough precision            so that if the number is read back in the exact same binary            value is recreated.  This is true for IEEE floating point            by design, and also happens to work for all other modern            hardware. */         precision = 17;         format_code = 'g';         break;     // ... } 

Wikipedia confirms this:

The 53-bit significand precision gives from 15 to 17 significant decimal digits precision (2-53 ≈ 1.11 × 10-16). If a decimal string with at most 15 significant digits is converted to IEEE 754 double-precision representation, and then converted back to a decimal string with the same number of digits, the final result should match the original string. If an IEEE 754 double-precision number is converted to a decimal string with at least 17 significant digits, and then converted back to double-precision representation, the final result must match the original number.

Thus, when we use str (or repr), we are only representing 17 significant digits in base-10. This means some of the floating point number will be truncated. In fact, to get the exact representation, you need a precision of 53 significant digits! You can verify this as follows:

>>> counts = Counter( ...     len(f"{random():.99f}".lstrip("0.").rstrip("0")) ...     for _ in range(1000000) ... ) >>> counts Counter({53: 449833,          52: 270000,          51: 139796,          50: 70341,          49: 35030,          48: 17507,          47: 8610,          46: 4405,          45: 2231,          44: 1120,          43: 583,          42: 272,          41: 155,          40: 60,          39: 25,          38: 13,          37: 6,          36: 5,          35: 4,          34: 3,          32: 1}) >>> max(counts) 53 

Now using the maximum precision, here's the proper way to find the "last digit":

>>> counts = Counter( ...     int(f"{random():.53f}".lstrip("0.").rstrip("0")[-1]) ...     for _ in range(1000000) ... ) >>> counts Counter({5: 1000000}) 

Thus, the last digit is always 5. (Or, in very rare cases, 0.) This makes sense since:

2**0  == 1.0 2**-1 == 0.5 2**-2 == 0.25 2**-3 == 0.125 2**-4 == 0.0625 2**-5 == 0.03125 2**-6 == 0.015625 2**-7 == 0.0078125 2**-8 == 0.00390625 2**-9 == 0.001953125 ... 2**-k == 0.[k-1 digits]5 

And all mantissas are some partial sum of these coefficients.


NOTE: As pointed out by user2357112, the correct implementations to look at are PyOS_double_to_string and format_float_short, but I'll leave the current one in because it's more pedagogically interesting.

like image 20
Mateen Ulhaq Avatar answered Oct 10 '22 06:10

Mateen Ulhaq