I would like to generate a random string (or a series of random strings, repetitions allowed) of length between 1 and n characters from some (finite) alphabet. Each string should be equally likely (in other words, the strings should be uniformly distributed).
The uniformity requirement means that an algorithm like this doesn't work:
alphabet = "abcdefghijklmnopqrstuvwxyz"
len = rand(1, n)
s = ""
for(i = 0; i < len; ++i)
    s = s + alphabet[rand(0, 25)]
(pseudo code, rand(a, b) returns a integer between a and b, inclusively, each integer equally likely)
This algorithm generates strings with uniformly distributed lengths, but the actual distribution should be weighted toward longer strings (there are 26 times as many strings with length 2 as there are with length 1, and so on.) How can I achieve this?
What you need to do is generate your length and then your string as two distinct steps.  You will need to first chose the length using a weighted approach.  You can calculate the number of strings of a given length l for an alphabet of k symbols as k^l.  Sum those up and then you have the total number of strings of any length, your first step is to generate a random number between 1 and that value and then bin it accordingly.  Modulo off by one errors you would break at 26, 26^2, 26^3, 26^4 and so on.  The logarithm based on the number of symbols would be useful for this task.
Once you have you length then you can generate the string as you have above.
Okay, there are 26 possibilities for a 1-character string, 262 for a 2-character string, and so on up to 2626 possibilities for a 26-character string.
That means there are 26 times as many possibilities for an (N)-character string than there are for an (N-1)-character string. You can use that fact to select your length:
def getlen(maxlen):
    sz = maxlen
    while sz != 1:
        if rnd(27) != 1:
            return sz
        sz--;
    return 1
I use 27 in the above code since the total sample space for selecting strings from "ab" is the 26 1-character possibilities and the 262 2-character possibilities. In other words, the ratio is 1:26 so 1-character has a probability of 1/27 (rather than 1/26 as I first answered).
This solution isn't perfect since you're calling rnd multiple times and it would be better to call it once with an possible range of 26N+26N-1+261 and select the length based on where your returned number falls within there but it may be difficult to find a random number generator that'll work on numbers that large (10 characters gives you a possible range of 2610+...+261 which, unless I've done the math wrong, is 146,813,779,479,510).
If you can limit the maximum size so that your rnd function will work in the range, something like this should be workable:
def getlen(chars,maxlen):
    assert maxlen >= 1
    range = chars
    sampspace = 0
    for i in 1 .. maxlen:
        sampspace = sampspace + range
        range = range * chars
    range = range / chars
    val = rnd(sampspace)
    sz = maxlen
    while val < sampspace - range:
        sampspace = sampspace - range
        range = range / chars
        sz = sz - 1
    return sz
Once you have the length, I would then use your current algorithm to choose the actual characters to populate the string.
Explaining it further:
Let's say our alphabet only consists of "ab". The possible sets up to length 3 are [ab] (2), [ab][ab] (4) and [ab][ab][ab] (8). So there is a 8/14 chance of getting a length of 3, 4/14 of length 2 and 2/14 of length 1.
The 14 is the magic figure: it's the sum of all 2n for n = 1 to the maximum length. So, testing that pseudo-code above with chars = 2 and maxlen = 3:
    assert maxlen >= 1 [okay]
    range = chars [2]
    sampspace = 0
    for i in 1 .. 3:
        i = 1:
            sampspace = sampspace + range [0 + 2 = 2]
            range = range * chars [2 * 2 = 4]
        i = 2:
            sampspace = sampspace + range [2 + 4 = 6]
            range = range * chars [4 * 2 = 8]
        i = 3:
            sampspace = sampspace + range [6 + 8 = 14]
            range = range * chars [8 * 2 = 16]
    range = range / chars [16 / 2 = 8]
    val = rnd(sampspace) [number from 0 to 13 inclusive]
    sz = maxlen [3]
    while val < sampspace - range: [see below]
        sampspace = sampspace - range
        range = range / chars
        sz = sz - 1
    return sz
So, from that code, the first iteration of the final loop will exit with sz = 3 if val is greater than or equal to sampspace - range [14 - 8 = 6]. In other words, for the values 6 through 13 inclusive, 8 of the 14 possibilities.
Otherwise, sampspace becomes sampspace - range [14 - 8 = 6] and range becomes range / chars [8 / 2 = 4].
Then the second iteration of the final loop will exit with sz = 2 if val is greater than or equal to sampspace - range [6 - 4 = 2]. In other words, for the values 2 through 5 inclusive, 4 of the 14 possibilities.
Otherwise, sampspace becomes sampspace - range [6 - 4 = 2] and range becomes range / chars [4 / 2 = 2].
Then the third iteration of the final loop will exit with sz = 1 if val is greater than or equal to sampspace - range [2 - 2 = 0]. In other words, for the values 0 through 1 inclusive, 2 of the 14 possibilities (this iteration will always exit since the value must be greater than or equal to zero.
In retrospect, that second solution is a bit of a nightmare. In my personal opinion, I'd go for the first solution for its simplicity and to avoid the possibility of rather large numbers.
Building on my comment posted as a reply to the OP:
I'd consider it an exercise in base conversion. You're simply generating a "random number" in "base 26", where a=0 and z=25. For a random string of length n, generate a number between 1 and 26^n. Convert from base 10 to base 26, using symbols from your chosen alphabet.
Here's a PHP implementation. I won't guaranty that there isn't an off-by-one error or two in here, but any such error should be minor:
<?php
$n = 5;
var_dump(randstr($n));
function randstr($maxlen) {
        $dict = 'abcdefghijklmnopqrstuvwxyz';
        $rand = rand(0, pow(strlen($dict), $maxlen));
        $str = base_convert($rand, 10, 26);
        //base convert returns base 26 using 0-9 and 15 letters a-p(?)
        //we must convert those to our own set of symbols
        return strtr($str, '1234567890abcdefghijklmnopqrstuvwxyz', $dict);
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With