Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient method to find collision free random numbers

I have a users table, the user ID is public. But I want to obfuscate the number of registered user and trends of the project, so I don't want to have public incrementing IDs.

When a new user is created I want to find a random integer number that is greater than a certain number and that is not yet in the database.

Naive code:

<?php
    $found = false;
    while(!$found) {
      $uid = rand(1000000000,4294967295) // find random number betwen minimum and maximum
      $dbh->beginTransaction();
      // check if user id is in use, and if not insert it
      if($dbh->query("SELECT * FROM users WHERE uid = $uid")) {
        $dbh->exec("INSERT INTO users (uid) VALUES ($uid)");
        $found = true;
      }
      $dbh->commit();
    }
    // we just got our new uid ...
?>

This will work it however may become inefficient. True that there is a big range and the probability of hitting an unused uid is high. But what if I want to use a smaller range, because I don't want to have so long userids?

Example of my concerns:

  • 60% of all user ids are in use
  • the chance of hitting an unused uid are 0.4
  • the first attempt has 0.4% success rate
  • if 1st not successful the second attempt has 0.6*0.4 probability
  • so with a maximum of two tries i have 0.4 + 0.6*0.4 proability (is that right??)

So one method to optimize is that came to my mind is the following:

  • find a random number, check if its free, if not, increment it by 1 and try again and so on
  • if the maximum number is hit, continue with the minimum number

That should give me a number with a maximum runtime of O(range)

That sounds pretty bad but I think it is not, because I submit random numbers to the database and that they are all at the beginnig is very unlikely. So how good/bad is it really?

I think this would work just fine but I want it BETTER

So what about this?

  • find a random number
  • query the database for how many numbers are occupied in the range whole range, starting from that number (this first step is trivial...)
  • if there are numbers occupied in that range, divide the range by half and try again. starting with the initial number
  • if there are numbers occupied divide the range by half and try again. starting with the initial number

If I am thinking correctly this will give ma a number with a maximum of O(log(range)) time.

That is pretty satisfying because log() is pretty good. However I think this method will often be as bad as possible. Because with our random numbers we will probably always hit numbers in the large intervals.

So at the beginning our pure random method is probably better.

So what about having a limit like this

  • select current number of used numbers
  • is it greater than X, logarithmic range approach
  • if it is not, use pure random method

What would X be and why?

So final question:

This is pretty easy and pretty complicated at the same time.

I think this is a standard problem because lots and lots of system use random ids (support tickets etc), so I cannot imagine I am the first one to stumble across this.

How would you solve this? Any input is appriciated!

Is there maby an existing class / procedure for this I can use?

Or maby some database functions that I can use?

I would like to do it in PHP/Mysql

IMPORTANT EDIT:

I just thought about the range/logarithmic solution. It seems to be complete bullshit sorry for my wording because:

  • what if i hit an occupied number at start?

Then I am dividing my range so long if it is only 1. And even then the number is occoupied.

So its completely the same as the pure random method from start, only worse....

I am a bit embarassed I made this up but I will leave it in because I think its a good example of overcomplicated thinknig!

like image 785
The Surrican Avatar asked Jul 24 '11 17:07

The Surrican


People also ask

What methods can be used to get a random number between 0 and 1?

We can use srand () and rand () function to generate random numbers between 0 and 1. Note that we have to cast the output of rand () function to the decimal value either float or double.


1 Answers

If p is the proportion of ids in use, your "naive" solution will, on average, require 1/(1-p) attempts to find an unused id. (See Exponential distribution). In the case of 60% occupancy, that is a mere 1/0.4 = 2.5 queries ...

Your "improved" solution requires about log(n) database calls, where n is the number of ids in use. That is quite a bit more than the "naive" solution. Also, your improved solution is incomplete (for instance, it does not handle the case where all number in a subrange are taken, and does not elaborate with subrange you recurse into) and is more complex to implement to boot.

Finally, note that your implementation will only be thread safe if the database provides very strict transaction isolation, which scales poorly, and might not be the default behaviour of your database system. If that turns out to be a problem, you could speculatively insert with a random id, and retry in the event of a constraint violation.

like image 53
meriton Avatar answered Oct 05 '22 06:10

meriton