Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Random integers in C, how bad is rand()%N compared to integer arithmetic? What are its flaws?

EDIT: My question is: rand()%N is considered very bad, whereas the use of integer arithmetic is considered superior, but I cannot see the difference between the two.

People always mention:

  • low bits are not random in rand()%N,

  • rand()%N is very predictable,

  • you can use it for games but not for cryptography

Can someone explain if any of these points are the case here and how to see that?

The idea of the non-randomness of the lower bits is something that should make the PE of the two cases that I show differ, but it's not the case.

I guess many like me would always avoid using rand(), or rand()%N because we've been always taught that it is pretty bad. I was curious to see how "wrong" random integers generated with c rand()%N effectively are. This is also a follow up to Ryan Reich's answer in How to generate a random integer number from within a range.

The explanation there sounds very convincing, to be honest; nevertheless, I thought I’d give it a try. So, I compare the distributions in a VERY naive way. I run both random generators for different numbers of samples and domains. I didn't see the point of computing a density instead of histograms, so I just computed histograms and, just by looking, I would say they both look just as uniform. Regarding the other point that was raised, about the actual randomness (despite being uniformly distributed). I — again naively —compute the permutation entropy for these runs, which are the same for both sample sets, which tell us that there's no difference between both regarding the ordering of the occurrence.

So, for many purposes, it seems to me that rand()%N would be just fine, how can we see their flaws?

Here I show you a very simple, inefficient and not very elegant (but I think correct) way of computing these samples and get the histograms together with the permutation entropies. I show plots for domains (0,i) with i in {5,10,25,50,100} for different number of samples:

5 values, 5k samples

10 values 10k samples

25 values, 250k samples

100values,  1M samples

There's not much to see in the code I guess, so I will leave both the C and the matlab code for replication purposes.

#include <stdlib.h>
#include <stdio.h>
#include <time.h>

int main(int argc, char *argv[]){
        unsigned long max = atoi(argv[2]);
        int samples=atoi(argv[3]);
        srand(time(NULL));
        if(atoi(argv[1])==1){
                for(int i=0;i<samples;++i)
                        printf("%ld\n",rand()%(max+1));

        }else{
                for(int i=0;i<samples;++i){
                        unsigned long
                        num_bins = (unsigned long) max + 1,
                        num_rand = (unsigned long) RAND_MAX + 1,
                        bin_size = num_rand / num_bins,
                        defect   = num_rand % num_bins;

                        long x;
                        do {
                                x = rand();
                        }
                        while (num_rand - defect <= (unsigned long)x);
                        printf("%ld\n",x/bin_size);
                }
        }
        return 0;
}

And here is the Matlab code to plot this and compute the PEs (the recursion for the permutations I took it from: https://www.mathworks.com/matlabcentral/answers/308255-how-to-generate-all-possible-permutations-without-using-the-function-perms-randperm):

system('gcc randomTest.c -o randomTest.exe;');
max = 100;
samples = max*10000;
trials = 200;
system(['./randomTest.exe 1 ' num2str(max) ' ' num2str(samples) ' > file1'])
system(['./randomTest.exe 2 ' num2str(max) ' ' num2str(samples) ' > file2'])
a1=load('file1');
a2=load('file2');
uni = figure(1);
title(['Samples: ' num2str(samples)])
subplot(1,3,1)
h1 = histogram(a1,max+1);
title('rand%(max+1)')
subplot(1,3,2)
h2 = histogram(a2,max+1);
title('Integer arithmetic')
as=[a1,a2];
ns=3:8;
H = nan(numel(ns),size(as,2));
for op=1:size(as,2)
    x = as(:,op);
    for n=ns
        sequenceOcurrence = zeros(1,factorial(n));
        sequences = myperms(1:n);
        sequencesArrayIdx = sum(sequences.*10.^(size(sequences,2)-1:-1:0),2);
        for i=1:numel(x)-n
            [~,sequenceOrder] = sort(x(i:i+n-1));
            out = sequenceOrder'*10.^(numel(sequenceOrder)-1:-1:0).';
            sequenceOcurrence(sequencesArrayIdx == out) = sequenceOcurrence(sequencesArrayIdx == out) + 1;
        end
        chunks = length(x) - n + 1;
        ps = sequenceOcurrence/chunks;
        hh = sum(ps(logical(ps)).*log2(ps(logical(ps))));
        H(n,op) = hh/log2(factorial(n));
    end
end
subplot(1,3,3)
plot(ns,H(ns,:),'--*','linewidth',2)
ylabel('PE')
xlabel('Sequence length')
filename = ['all_' num2str(max) '_' num2str(samples) ];
export_fig(filename)
like image 741
myradio Avatar asked Apr 17 '18 14:04

myradio


People also ask

Why rand() is bad?

The most visible problem of it is that it lacks a distribution engine: rand gives you a number in interval [0 RAND_MAX] . It is uniform in this interval, which means that each number in this interval has the same probability to appear. But most often you need a random number in a specific interval.

How does rand() in C work?

DESCRIPTION The rand() function returns a pseudo-random integer in the range 0 to RAND_MAX inclusive (i.e., the mathematical range [0, RAND_MAX]). The srand() function sets its argument as the seed for a new sequence of pseudo-random integers to be returned by rand().

What does rand() return in C?

Description. The C library function int rand(void) returns a pseudo-random number in the range of 0 to RAND_MAX. RAND_MAX is a constant whose default value may vary between implementations but it is granted to be at least 32767.

How does C random number generator work?

rand() rand() function is an inbuilt function in C++ STL, which is defined in header file <cstdlib>. rand() is used to generate a series of random numbers. The random number is generated by using an algorithm that gives a series of non-related numbers whenever this function is called.


1 Answers

Due to the way modulo arithmetic works if N is significant compared to RAND_MAX doing %N will make it so you're considerably more likely to get some values than others. Imagine RAND_MAX is 12, and N is 9. If the distribution is good then the chances of getting one of 0, 1, or 2 is 0.5, and the chances of getting one of 3, 4, 5, 6, 7, 8 is 0.5. The result being that you're twice as likely to get a 0 instead of a 4. If N is an exact divider of RAND_MAX this distribution problem doesn't happen, and if N is very small compared to RAND_MAX the issue becomes less noticeable. RAND_MAX may not be a particularly large value (maybe 2^15 - 1), making this problem worse than you may expect. The alternative of doing (rand() * n) / (RAND_MAX + 1) also doesn't give an even distribution, however, it will be every mth value (for some m) that will be more likely to occur rather than the more likely values all being at the low end of the distribution.

If N is 75% of RAND_MAX then the values in the bottom third of your distribution are twice as likely as the values in the top two thirds (as this is where the extra values map to)

The quality of rand() will depend on the implementation of the system that you're on. I believe that some systems have had very poor implementation, OS Xs man pages declare rand obsolete. The Debian man page says the following:

The versions of rand() and srand() in the Linux C Library use the same random number generator as random(3) and srandom(3), so the lower-order bits should be as random as the higher-order bits. However, on older rand() implementations, and on current implementations on different systems, the lower-order bits are much less random than the higher- order bits. Do not use this function in applications intended to be portable when good randomness is needed. (Use random(3) instead.)

like image 173
James Snook Avatar answered Sep 21 '22 01:09

James Snook