Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

java random string generation and birthday paradox

i need to write a random string generation class which generates 7char strings from a 31-char charset of numbers and some alphabets (10+26-5 , 5 vowels omitted). simple maths gives a set of 31^7 possible combinations ~ 27.5 billion. i have questions regarding the bday paradox, i ran some tests and the number of duplicates increase exponentially. can i do something to avoid this ?

At 1 million, duplicates encountered till now = 19
At 2 million, duplicates encountered till now = 69
At 3 million, duplicates encountered till now = 157
At 4 million, duplicates encountered till now = 280
At 5 million, duplicates encountered till now = 470
At 6 million, duplicates encountered till now = 662
At 7 million, duplicates encountered till now = 896
At 8 million, duplicates encountered till now = 1185
At 9 million, duplicates encountered till now = 1500
At 10 million, duplicates encountered till now = 1823
At 11 million, duplicates encountered till now = 2204
At 12 million, duplicates encountered till now = 2584
At 13 million, duplicates encountered till now = 3020
At 14 million, duplicates encountered till now = 3527
At 15 million, duplicates encountered till now = 4110
At 16 million, duplicates encountered till now = 4683
At 17 million, duplicates encountered till now = 5284
At 18 million, duplicates encountered till now = 5919
At 19 million, duplicates encountered till now = 6611
At 20 million, duplicates encountered till now = 7343
At 21 million, duplicates encountered till now = 8095
At 22 million, duplicates encountered till now = 8858
At 23 million, duplicates encountered till now = 9707
At 24 million, duplicates encountered till now = 10547
At 25 million, duplicates encountered till now = 11452
At 26 million, duplicates encountered till now = 12399
At 27 million, duplicates encountered till now = 13356
At 28 million, duplicates encountered till now = 14393
At 29 million, duplicates encountered till now = 15369
At 30 million, duplicates encountered till now = 16436

Below is the test class:

import java.util.Set;

import org.apache.commons.lang3.RandomStringUtils;

import com.google.common.collect.Sets;

public class RandomUnivmylocaL {

    public static void main(String[] argv) {

        final int million = 1_000_000;

        final int iterations = 30;
        // 31 chars
        final char[] charArr = new char[] { '1', '2', '3', '4', '5', '6', '7', 
                '8', '9', '0', 'B', 'C', 'D', 'F', 'G', 'H', 'J', 'K', 'L',
                'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'X', 'Y', 'Z' };
        // System.out.println(charArr.length);

        final Set<String> set = Sets.newHashSetWithExpectedSize(
                iterations * million);

        for (int i = 0; i < iterations; i++) {
            for (int j = 0; j < million; j++) {
                final String univCode = RandomStringUtils.random(7, charArr);
                set.add(univCode);
            }
            System.out.println("At " + (i + 1) + " million, " +
                    "duplicates encountered till now = " + 
                    (((i + 1) * million) - set.size()));
        }
        System.out.println("done");
    }
}
like image 738
Ashish Thukral Avatar asked Jul 08 '15 03:07

Ashish Thukral


1 Answers

This is the birthday paradox.

sqrt(27.5bn) = 165831, birthday paradox formula for large M is 1.177*sqrt(M), so when you have generated about 200,000 there would be a 50/50 chance of a problem, by the time you get to a million you would have about 18 problems, etc..

The birthday problem - how many until there is a likelyhood that you will get a collision - about 200,000. http://www.wolframalpha.com/input/?i=n%3D200000,+m%3D(31%5E7),+n+-+m(1+-+((m-1)%2Fm)%5En)

Set n = 23.0 and m = 365 to see the 23 people in a room equivalent. http://www.wolframalpha.com/input/?i=n%3D23.0,+m%3D365,+n+-+m(1+-+((m-1)%2Fm)%5En)

You can see how close your simulation was to the expected answer for large numbers.

http://www.wolframalpha.com/input/?i=n%3D30e6,+m%3D(31%5E7),+n+-+m(1+-+((m-1)%2Fm)%5En)

The Quora article is good. http://www.quora.com/How-can-I-calculate-the-expected-number-of-cache-hits.

So you need to up the number of allowed characters or use a longer string. OR - to use 7 digits, simply increment the counter. OR use random ones, and check if you have used it before, and reset when you get tired of looking for new numbers.

There are also pseudo random generators that could cover the space without having re-hits. 7 characters is just not going to make your solution secure.

like image 73
Tom Andersen Avatar answered Oct 03 '22 01:10

Tom Andersen