EDIT: My question is: <code>rand()%N</code> is considered very bad, whereas the use of integer arithmetic is considered superior, but I cannot see the difference between the two. People always mention: <ul> <li>low bits are not random in <code>rand()%N</code>,</li> <li><code>rand()%N</code> is very predictable,</li> <li>you can use it for games but not for cryptography</li> </ul> Can someone explain if any of these points are the case here and how to see that? The idea of the non-randomness of the lower bits is something that should make the PE of the two cases that I show differ, but it's not the case. I guess many like me would always avoid using <code>rand()</code>, or <code>rand()%N</code> because we've been always taught that it is pretty bad. I was curious to see how "wrong" random integers generated with c <code>rand()%N</code> effectively are. This is also a follow up to Ryan Reich's answer in How to generate a random integer number from within a range. The explanation there sounds very convincing, to be honest; nevertheless, I thought I’d give it a try. So, I compare the distributions in a VERY naive way. I run both random generators for different numbers of samples and domains. I didn't see the point of computing a density instead of histograms, so I just computed histograms and, just by looking, I would say they both look just as uniform. Regarding the other point that was raised, about the actual randomness (despite being uniformly distributed). I — again naively —compute the permutation entropy for these runs, which are the same for both sample sets, which tell us that there's no difference between both regarding the ordering of the occurrence. So, for many purposes, it seems to me that <code>rand()%N</code> would be just fine, how can we see their flaws? Here I show you a very simple, inefficient and not very elegant (but I think correct) way of computing these samples and get the histograms together with the permutation entropies. I show plots for domains (0,i) with i in {5,10,25,50,100} for different number of samples: <img src="https://i.stack.imgur.com/s26Tw.png" alt="5 values, 5k samples"> <img src="https://i.stack.imgur.com/REIxa.png" alt="10 values 10k samples"> <img src="https://i.stack.imgur.com/Yl0VR.png" alt="25 values, 250k samples"> <img src="https://i.stack.imgur.com/KTLTg.png" alt="100values, 1M samples"> There's not much to see in the code I guess, so I will leave both the C and the matlab code for replication purposes. <pre class="prettyprint"><code>#include <stdlib.h> #include <stdio.h> #include <time.h> int main(int argc, char *argv[]){ unsigned long max = atoi(argv[2]); int samples=atoi(argv[3]); srand(time(NULL)); if(atoi(argv[1])==1){ for(int i=0;i<samples;++i) printf("%ld\n",rand()%(max+1)); }else{ for(int i=0;i<samples;++i){ unsigned long num_bins = (unsigned long) max + 1, num_rand = (unsigned long) RAND_MAX + 1, bin_size = num_rand / num_bins, defect = num_rand % num_bins; long x; do { x = rand(); } while (num_rand - defect <= (unsigned long)x); printf("%ld\n",x/bin_size); } } return 0; } </code></pre> And here is the Matlab code to plot this and compute the PEs (the recursion for the permutations I took it from: https://www.mathworks.com/matlabcentral/answers/308255-how-to-generate-all-possible-permutations-without-using-the-function-perms-randperm): <pre class="prettyprint"><code>system('gcc randomTest.c -o randomTest.exe;'); max = 100; samples = max*10000; trials = 200; system(['./randomTest.exe 1 ' num2str(max) ' ' num2str(samples) ' > file1']) system(['./randomTest.exe 2 ' num2str(max) ' ' num2str(samples) ' > file2']) a1=load('file1'); a2=load('file2'); uni = figure(1); title(['Samples: ' num2str(samples)]) subplot(1,3,1) h1 = histogram(a1,max+1); title('rand%(max+1)') subplot(1,3,2) h2 = histogram(a2,max+1); title('Integer arithmetic') as=[a1,a2]; ns=3:8; H = nan(numel(ns),size(as,2)); for op=1:size(as,2) x = as(:,op); for n=ns sequenceOcurrence = zeros(1,factorial(n)); sequences = myperms(1:n); sequencesArrayIdx = sum(sequences.*10.^(size(sequences,2)-1:-1:0),2); for i=1:numel(x)-n [~,sequenceOrder] = sort(x(i:i+n-1)); out = sequenceOrder'*10.^(numel(sequenceOrder)-1:-1:0).'; sequenceOcurrence(sequencesArrayIdx == out) = sequenceOcurrence(sequencesArrayIdx == out) + 1; end chunks = length(x) - n + 1; ps = sequenceOcurrence/chunks; hh = sum(ps(logical(ps)).*log2(ps(logical(ps)))); H(n,op) = hh/log2(factorial(n)); end end subplot(1,3,3) plot(ns,H(ns,:),'--*','linewidth',2) ylabel('PE') xlabel('Sequence length') filename = ['all_' num2str(max) '_' num2str(samples) ]; export_fig(filename) </code></pre>

Due to the way modulo arithmetic works if N is significant compared to RAND_MAX doing %N will make it so you're considerably more likely to get some values than others. Imagine RAND_MAX is 12, and N is 9. If the distribution is good then the chances of getting one of 0, 1, or 2 is 0.5, and the chances of getting one of 3, 4, 5, 6, 7, 8 is 0.5. The result being that you're twice as likely to get a 0 instead of a 4. If N is an exact divider of RAND_MAX this distribution problem doesn't happen, and if N is very small compared to RAND_MAX the issue becomes less noticeable. RAND_MAX may not be a particularly large value (maybe 2^15 - 1), making this problem worse than you may expect. The alternative of doing <code>(rand() * n) / (RAND_MAX + 1)</code> also doesn't give an even distribution, however, it will be every <code>m</code>th value (for some <code>m</code>) that will be more likely to occur rather than the more likely values all being at the low end of the distribution. If N is 75% of RAND_MAX then the values in the bottom third of your distribution are twice as likely as the values in the top two thirds (as this is where the extra values map to) The quality of <code>rand()</code> will depend on the implementation of the system that you're on. I believe that some systems have had very poor implementation, OS Xs man pages declare <code>rand</code> obsolete. The Debian man page says the following: The versions of rand() and srand() in the Linux C Library use the same random number generator as random(3) and srandom(3), so the lower-order bits should be as random as the higher-order bits. However, on older rand() implementations, and on current implementations on different systems, the lower-order bits are much less random than the higher- order bits. Do not use this function in applications intended to be portable when good randomness is needed. (Use random(3) instead.)

Random integers in C, how bad is rand()%N compared to integer arithmetic? What are its flaws?

Tags:

c

random

distribution

EDIT: My question is: rand()%N is considered very bad, whereas the use of integer arithmetic is considered superior, but I cannot see the difference between the two.

People always mention:

low bits are not random in rand()%N,
rand()%N is very predictable,
you can use it for games but not for cryptography

Can someone explain if any of these points are the case here and how to see that?

The idea of the non-randomness of the lower bits is something that should make the PE of the two cases that I show differ, but it's not the case.

I guess many like me would always avoid using rand(), or rand()%N because we've been always taught that it is pretty bad. I was curious to see how "wrong" random integers generated with c rand()%N effectively are. This is also a follow up to Ryan Reich's answer in How to generate a random integer number from within a range.

The explanation there sounds very convincing, to be honest; nevertheless, I thought I’d give it a try. So, I compare the distributions in a VERY naive way. I run both random generators for different numbers of samples and domains. I didn't see the point of computing a density instead of histograms, so I just computed histograms and, just by looking, I would say they both look just as uniform. Regarding the other point that was raised, about the actual randomness (despite being uniformly distributed). I — again naively —compute the permutation entropy for these runs, which are the same for both sample sets, which tell us that there's no difference between both regarding the ordering of the occurrence.

So, for many purposes, it seems to me that rand()%N would be just fine, how can we see their flaws?

Here I show you a very simple, inefficient and not very elegant (but I think correct) way of computing these samples and get the histograms together with the permutation entropies. I show plots for domains (0,i) with i in {5,10,25,50,100} for different number of samples:

5 values, 5k samples

10 values 10k samples

25 values, 250k samples

100values, 1M samples

There's not much to see in the code I guess, so I will leave both the C and the matlab code for replication purposes.

#include <stdlib.h>
#include <stdio.h>
#include <time.h>

int main(int argc, char *argv[]){
        unsigned long max = atoi(argv[2]);
        int samples=atoi(argv[3]);
        srand(time(NULL));
        if(atoi(argv[1])==1){
                for(int i=0;i<samples;++i)
                        printf("%ld\n",rand()%(max+1));

        }else{
                for(int i=0;i<samples;++i){
                        unsigned long
                        num_bins = (unsigned long) max + 1,
                        num_rand = (unsigned long) RAND_MAX + 1,
                        bin_size = num_rand / num_bins,
                        defect   = num_rand % num_bins;

                        long x;
                        do {
                                x = rand();
                        }
                        while (num_rand - defect <= (unsigned long)x);
                        printf("%ld\n",x/bin_size);
                }
        }
        return 0;
}

And here is the Matlab code to plot this and compute the PEs (the recursion for the permutations I took it from: https://www.mathworks.com/matlabcentral/answers/308255-how-to-generate-all-possible-permutations-without-using-the-function-perms-randperm):

system('gcc randomTest.c -o randomTest.exe;');
max = 100;
samples = max*10000;
trials = 200;
system(['./randomTest.exe 1 ' num2str(max) ' ' num2str(samples) ' > file1'])
system(['./randomTest.exe 2 ' num2str(max) ' ' num2str(samples) ' > file2'])
a1=load('file1');
a2=load('file2');
uni = figure(1);
title(['Samples: ' num2str(samples)])
subplot(1,3,1)
h1 = histogram(a1,max+1);
title('rand%(max+1)')
subplot(1,3,2)
h2 = histogram(a2,max+1);
title('Integer arithmetic')
as=[a1,a2];
ns=3:8;
H = nan(numel(ns),size(as,2));
for op=1:size(as,2)
    x = as(:,op);
    for n=ns
        sequenceOcurrence = zeros(1,factorial(n));
        sequences = myperms(1:n);
        sequencesArrayIdx = sum(sequences.*10.^(size(sequences,2)-1:-1:0),2);
        for i=1:numel(x)-n
            [~,sequenceOrder] = sort(x(i:i+n-1));
            out = sequenceOrder'*10.^(numel(sequenceOrder)-1:-1:0).';
            sequenceOcurrence(sequencesArrayIdx == out) = sequenceOcurrence(sequencesArrayIdx == out) + 1;
        end
        chunks = length(x) - n + 1;
        ps = sequenceOcurrence/chunks;
        hh = sum(ps(logical(ps)).*log2(ps(logical(ps))));
        H(n,op) = hh/log2(factorial(n));
    end
end
subplot(1,3,3)
plot(ns,H(ns,:),'--*','linewidth',2)
ylabel('PE')
xlabel('Sequence length')
filename = ['all_' num2str(max) '_' num2str(samples) ];
export_fig(filename)

741

asked Apr 17 '18 14:04

myradio

1 Answers

Due to the way modulo arithmetic works if N is significant compared to RAND_MAX doing %N will make it so you're considerably more likely to get some values than others. Imagine RAND_MAX is 12, and N is 9. If the distribution is good then the chances of getting one of 0, 1, or 2 is 0.5, and the chances of getting one of 3, 4, 5, 6, 7, 8 is 0.5. The result being that you're twice as likely to get a 0 instead of a 4. If N is an exact divider of RAND_MAX this distribution problem doesn't happen, and if N is very small compared to RAND_MAX the issue becomes less noticeable. RAND_MAX may not be a particularly large value (maybe 2^15 - 1), making this problem worse than you may expect. The alternative of doing (rand() * n) / (RAND_MAX + 1) also doesn't give an even distribution, however, it will be every mth value (for some m) that will be more likely to occur rather than the more likely values all being at the low end of the distribution.

If N is 75% of RAND_MAX then the values in the bottom third of your distribution are twice as likely as the values in the top two thirds (as this is where the extra values map to)

The quality of rand() will depend on the implementation of the system that you're on. I believe that some systems have had very poor implementation, OS Xs man pages declare rand obsolete. The Debian man page says the following:

The versions of rand() and srand() in the Linux C Library use the same random number generator as random(3) and srandom(3), so the lower-order bits should be as random as the higher-order bits. However, on older rand() implementations, and on current implementations on different systems, the lower-order bits are much less random than the higher- order bits. Do not use this function in applications intended to be portable when good randomness is needed. (Use random(3) instead.)

173

answered Sep 21 '22 01:09

James Snook

Related questions
                            
                                How is the GNU libc.so both a shared object and a standalone executable?
                            
                                Multiply-subtract in SSE
                            
                                stdin as input file for MSVC
                            
                                GCC 4.8 inserts version 4 in a compilation unit header even with -gdwarf-2
                            
                                Behavior of fputc() for a stream opened with read mode
                            
                                Is it a good practice to hide structure definition in C?
                            
                                How do I retrieve the correct size of native Windows controls?
                            
                                Duff's device in Swift
                            
                                Implementing a timing, cache attack resistant sort in C
                            
                                Is __TIME__ preprocessor macro guaranteed to be constant within a file?
                            
                                Representing NULL Function Pointers to C Functions in Swift
                            
                                dlsym returns NULL, even though the symbol exists
                            
                                How to pass list of arguments in function with variable number of parameters?
                            
                                Puzzling GCC behaviour with respect to vectorization and loop size
                            
                                How to efficiently transpose a 2D bit matrix
                            
                                Atomic load in C with MSVC
                            
                                How to force 64-bit LMDB to generate a 32-bit database?
                            
                                How can i execute an executable from memory?
                            
                                Embedded MariaDB C/C++ API
                            
                                Sleep() vs _sleep() functions

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With