Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating your own Tinyurl style uid

I'm writing a small article on humanly readable alternatives to Guids/UIDs, for example those used on TinyURL for the url hashes (which are often printed in magazines, so need to be short).

The simple uid I'm generating is - 6 characters: either a lowercase letter (a-z) or 0-9.

"According to my calculations captain", that's 6 mutually exclusive events, although calculating the probability of a clash gets a little harder than P(A or B) = P(A) + P(B), as obviously it includes numbers and from the code below, you can see it works out whether to use a number or letter using 50/50.

I'm interested in the clash rate and if the code below is a realistic simulation of anticipated clash rate you'd get from generating a hash. On average I get 40-50 clashes per million, however bare in mind the uid wouldn't be generated a million times at once, but probably only around 10-1000 times a minute.

What is the probability of a clash each time, and can anyone suggest a better way of doing it?

static Random _random = new Random();

public static void main()
{
    // Size of the key, 6
    HashSet<string> set = new HashSet<string>();
    int clashes = 0;
    for (int n=0;n < 1000000;n++)
    {
        StringBuilder builder = new StringBuilder();

        for (int i =0;i < 7;i++)
        {
            if (_random.NextDouble() > 0.5)
            {
                builder.Append((char)_random.Next(97,123));
            }
            else
            {
                builder.Append(_random.Next(0,9).ToString());
            }
        }

        if (set.Contains(builder.ToString()))
        {
            clashes++;
            Console.WriteLine("clash: (" +n+ ")" +builder.ToString());
        }

        set.Add(builder.ToString());
        _random.Next();
        //Console.Write(builder.ToString());
    }

    Console.WriteLine("Clashes: " +clashes);
    Console.ReadLine();
}

UPDATE: Here's the resulting article from this question

I really asked two questions here so I was cheating. The answer I was after was rcar's, however Sklivvz's is also the answer to the 2nd part (an alternative). Is it possible to make a custom unique id generator in a database, or would it be client side (which would be 2 possible reads first)?

The general idea I was after was using Ids in databases or other stores that can be used by phone or printed material, not a giant 16 byte guid.

UPDATE 2: I put the formula for two mutually exclusive events above instead of 2 independent ones (as getting an 'a' the first time doesn't mean you can't get an 'a' the second time). Should've been P(A and B) = P(A) x P(B)

like image 886
Chris S Avatar asked Oct 10 '08 10:10

Chris S


People also ask

How do you design a URL shortening service like TinyURL or bit ly?

System Design goals Click to the short URL should redirect the user to the original long URL. Shortened link should be as small as possible. Users can create custom url with maximum character limit of 16. Service should collect metrics like most clicked links.

What is this TinyURL link?

TinyURL is a URL shortening web service, which provides short aliases for redirection of long URLs. Kevin Gilbertson, a web developer, launched the service in January 2002 as a way to post links in newsgroup postings which frequently had long, cumbersome addresses.


4 Answers

Why do you want to use a random function? I always assumed that tinyurl used a base 62 (0-9A-Za-z) representation of a sequential Id. No clashes and the urls are always as short as possible.

You would have a DB table like

Id  URL
 1  http://google.com
 2  ...
... ...
156 ...
... ...

and the corresponding URLs would be:

http://example.com/1
http://example.com/2
...
http://example.com/2W
...
like image 67
Sklivvz Avatar answered Oct 30 '22 03:10

Sklivvz


Look up the Birthday Paradox, it's the exact problem that you're running into.

The question is: How many people do you need to get together in a room, so that you have a 50% chance of any two people having the same birthdate? The answer may surprise you.

like image 34
Greg Hewgill Avatar answered Oct 30 '22 03:10

Greg Hewgill


Some time ago I did exactly this, and I followed the way Sklivvz mentioned. The whole logic was developed with a SQL server stored procedure and a couple of UDF (user defined functions). The steps were:

  • say that you want to shorten this url: Creating your own Tinyurl style uid
  • Insert the URL in a table
  • Obtain the @@identity value of the last insert (a numeric id)
  • Transform the id in a corresponding alphanumeric value, based on a "domain" of letters and numbers (I actually used this set: "0123456789abcdefghijklmnopqrstuvwxyz")
  • Return that value back, something like 'cc0'

The conversion was realized thru a couple of very short UDF.

Two conversion called one after the other would return "sequential" values like these:

select dbo.FX_CONV (123456) -- returns "1f5n"

select dbo.FX_CONV (123457) -- returns "1f5o"

If you are interested I can share the UDF's code.

like image 38
ila Avatar answered Oct 30 '22 05:10

ila


The probability of a collision against one specific ID is:

p = ( 0.5 * ( (0.5*1/10) + (0.5*1/26) ) )^6

which is around 1.7×10^-9.

The probability of a collision after generating n IDs is 1-p^n, so you'll have roughly a 0.17% chance of a collision for each new insertion after 1 million IDs have been inserted, around 1.7% after 10 million IDs, and around 16% after 100 million.

1000 IDs/minute works out to about 43 million/month, so as Sklivvz pointed out, using some incrementing ID is probably going to be a better way to go in this case.

EDIT:

To explain the math, he's essentially flipping a coin and then picking a number or letter 6 times. There's a 0.5 probability that the coin flip matches, and then 50% of the time there's a 1/10 chance of matching and a 50% chance of a 1/26 chance of matching. That happens 6 times independently, so you multiply those probabilities together.

like image 32
Randy Avatar answered Oct 30 '22 05:10

Randy