Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create constrained random numbers?

CLEANED UP TEXT:

How can I create m=5 random numbers that add upp to, say n=100. But, the first random number is say, 10 < x1 < 30, the second random nr is 5 < x2 < 20, the third random nr is 10 < x3 < 25, etc. So these five random numbers add up to 100. How can I create these constrained five numbers?

.

[[

Related problem A1): The standard way to create five random numbers that add up to 100, is to sample four numbers between [0,100], and add the boundaries 0 and 100, and then sort these six numbers [0,x1,x2,x3,x4,100]. The five random numbers I seek, are the deltas. That is,

100 - x[4] = delta 5
x[4]- x[3] = delta 4
x[3]- x[2] = delta 3
x[2]- x[1] = delta 2
x[1] - 0   = delta 1

These five deltas will now add up to 100. For instance, they might be 0,1,2,7,90. Here is some code that solves this problem:

total_sum = 100
n = 5
v = numpy.random.multinomial(total_sum, numpy.ones(n)/n)

]]

.

For my problem, I can not allow wide intervals to occur, the largest spread above is 90-7 = 83 which is too wide. So, I have to specify a tighter spread, say [10,30]. This means the largest random number is 30, which disallows large spreads such as 83.

.

[[

Related problem A2): A partial solution to create five numbers with identical boundaries, 10 < x_i < 30, that adds up to 100 is like this: Just do like in A1) but add the lower boundary 10, to the deltas. So I get the five random numbers that I seek like this:

100 - x[4] = delta 5 + 10
x[4]- x[3] = delta 4 + 10
x[3]- x[2] = delta 3 + 10
x[2]- x[1] = delta 2 + 10
x[1] - 0   = delta 1 + 10

Basically, I do exactly like in A1) but do not start from 0, but start from 10. Thus, each number has the lower boundary 10, but they dont have an upper boundary, it can be large, too large. How to limit the upper boundary to 30? Here the problem is how to limit the upper boundary

]]

.

To recapitulate, the type of the problem I try to solve looks like this: I need five random numbers adding up to 100 and I need to specify the boundaries separately for each number, say [10,30] for the first random number, and then [5,10] for the second random number, and [15,35] for the third random number, etc. And they must all add up to 100.

But the real data I am using, has ~100 numbers x_i (m=50), all of them adding up to say ~400,000. And the range is typically [3000,5000] for a number x_i. These numbers are not really accurate, I am only trying to convey something about the problem size. The purpose is to do a MCMC simulation so these numbers need to be quickly generated. People have suggested very elegant solutions that really do work, but they take too long time, so I can not use them. The problem is still unsolved. Ideally I would like an O(m) solution and O(1) memory solution.

This problem should not be NP-hard, it doesnt feel like it. There should be a polynomial time solution, right?

like image 289
Orvar Korvar Avatar asked Aug 26 '13 16:08

Orvar Korvar


2 Answers

Suppose you need n_1 in [10,30], n_2 in [20,40], n_3 in [30,50] and n1+n2+n3=90

If you need each possible triplet (n_1, n_2, n_3) to be equally-likely, that's going to be difficult. The number of triples of the form (20, n_2, n_3) is greater than the number of triples (10, n_2, n_3), so you can't just pick n_1 uniformly.

The incredibly slow but accurate way is to generate the all 5 randoms in the correct ranges and reject the whole group if the sum is not correct.

. . . AHA!

I found a way to parametrize the choice effectively. First, though, for simplicity note that the sum of the low bounds is the minimum possible sum. If subtract the sum of the low bounds from the target number and subtract the low bound from each generated number, you get a problem where each number is in the interval [0, max_k-min_k]. That simplifies the math and array (list) handling. Let n_k be the 0-based choice with 0<=n_k<=max_k-min_k.

The order of the sums is lexicographic, with all sums beginning with n_1=0 (if any) first, then n_1==1 sums, etc. Sums are sorted by n_2 in each of those groups, then by n_3, and so on. If you know how many sums add to the target (call that T), and how many sums start with n_1=0, 1, 2, ... then you can find the starting number n1 of sum number S in in that list. Then you can reduce the problem to adding n_2+n_3+... to get T-n_1, finding sum number S - (number original sums starting with number less than n_1).

Let pulse(n) be a list of n+1 ones: (n+1)*[1] in Python terms. Let max_k,min_k be the limits for the k'th choice, and m_k = max_k-min_k be the upper limit for 0-based choices. Then there are 1+m_1 different "sums" from the choice of the first number, and pulse(m_k) gives the distribution: 1 was to make each sum from 0 to m_1. For the first two choices, there are m_1+m_+1 different sums. It turns out that the convolution of pulse(m_1) with pulse(m_2) gives the distribution.

Time to stop for some code:

    def pulse(width, value=1):
        ''' Returns a vector of (width+1) integer ones. '''
        return (width+1)*[value]

    def stepconv(vector, width):
        ''' Computes the discrete convolution of vector with a "unit"
            pulse of given width.

            Formula: result[i] = Sum[j=0 to width] 1*vector[i-j]
            Where 0 <= i <= len(vector)+width-1, and the "1*" is the value
            of the implied unit pulse function: pulse[j] = 1 for 0<=j<=width.
        '''
        result = width*[0] + vector;
        for i in range(len(vector)):
            result[i] = sum(result[i:i+width+1])
        for i in range(len(vector), len(result)):
            result[i] = sum(result[i:])
        return result

That's coded specifically for only doing convolutions with a "pulse" array, so every linear combination in the convolution is just a sum.

Those are used only in the constructor of the final class solution:

class ConstrainedRandom(object):
    def __init__(self, ranges=None, target=None, seed=None):
        self._rand = random.Random(seed)
        if ranges != None: self.setrange(ranges)
        if target != None: self.settarget(target)

    def setrange(self, ranges):
        self._ranges = ranges
        self._nranges = len(self._ranges)
        self._nmin, self._nmax = zip(*self._ranges)
        self._minsum = sum(self._nmin)
        self._maxsum = sum(self._nmax)
        self._zmax = [y-x for x,y in self._ranges]
        self._rconv = self._nranges * [None]
        self._rconv[-1] = pulse(self._zmax[-1])
        for k in range(self._nranges-1, 0, -1):
            self._rconv[k-1] = stepconv(self._rconv[k], self._zmax[k-1])

    def settarget(self, target):
        self._target = target

    def next(self, target=None):
        k = target if target != None else self._target
        k = k - self._minsum;
        N = self._rconv[0][k]
        seq = self._rand.randint(0,N-1)
        result = self._nranges*[0]
        for i in range(len(result)-1):
            cv = self._rconv[i+1]
            r_i = 0
            while k >= len(cv):
                r_i += 1
                k -= 1
            while cv[k] <= seq:
                seq -= cv[k]
                r_i += 1
                k -= 1
            result[i] = r_i
        result[-1] = k # t
        return [x+y for x,y in zip(result, self._nmin)]
    
    # end clss ConstrainedRandom

Use that with:

ranges = [(low, high), (low, high), ...]
cr = ConstrainedRandom(ranges, target)
seq = cr.next();
print(seq)
assert sum(seq)==target

seq = cr.next(); # get then get the next one.

...etc. The class could be trimmed down a bit, but the main space overhead is in the _rconv list, which has the stored convolutions. That's roughly N*T/2, for O(NT) storage.

The convolutions only use the ranges, with a lot of randoms generated with the same constraints, the table construction time "amortizes away" to zero. The time complexity of .next() is roughly T/2 on average and O(T), in terms of the number of indexes into the _rconv lists.


To see how the algorithm works, assume a sequence of 3 zero-based choices, with max values (5,7,3), and a 0-based target T=10. Define or import the pulse and stepconv functions in an Idle session, then:

>>> pulse(5)
[1, 1, 1, 1, 1, 1]
>>> K1 = pulse (5)
>>> K2 = stepconv(K1, 7)
>>> K3 = stepconv(K2, 3)
>>> K1
[1, 1, 1, 1, 1, 1]
>>> K2
[1, 2, 3, 4, 5, 6, 6, 6, 5, 4, 3, 2, 1]
>>> K3
[1, 3, 6, 10, 14, 18, 21, 23, 23, 21, 18, 14, 10, 6, 3, 1]
>>> K3[10]
18
>>> sum(K3)
192
>>> (5+1)*(7+1)*(3+1)
192

K3[i] shows the number of different choice n_1, n_2, n_3 such that 0 <= n_k <= m_k and Σ n_k = i. Letting * mean convolution when applied to two of these lists. Then pulse(m_2)*pulse(m_3) is gives the distribution of sums of n_2 and n_3:

>>> R23 = stepconv(pulse(7),3)
>>> R23
[1, 2, 3, 4, 4, 4, 4, 4, 3, 2, 1]
>>> len(R23)
11

Every value from 0 to T=10 is (barely) possible, so any choice is possible for the first number and there are R23[T-n_1] possible triplets adding to T=10 that start with N1. So, once you've found that there are 18 possible sums adding to 10, generate a random number S = randint(18) and count down through the R23[T:T-m_1-1:-1] array:

>>> R23[10:10-5-1:-1]
[1, 2, 3, 4, 4, 4]
>>> sum(R23[10:10-5-1:-1])
18

Note the sum of that list is the total computed in K3[10] above. A sanity check. Anyway, if S==9 was the random choice, then find how many leading terms of that array can be summed without exceeding S. That's the value of n_1. In this case 1+2+3 <= S but 1+2+3+4 > S, so n_1 is 3.

As described above, you can then reduce the problem to find n_2. The final number (n_3 in this example) will be uniquely determined.

like image 190
Mike Housky Avatar answered Sep 22 '22 00:09

Mike Housky


First, define your ranges:

ranges = [range(11,30), range(6,20), range(11,25)]

Then enumerate all possibilities:

def constrainedRandoms(ranges):
    answer = []
    for vector in itertools.product(*ranges):
        if sum(ranges) == 100:
            answer.append(vector)
    return answer

Equivalent one-liner:

answer = [v for v in itertools.product(*ranges) if sum(v)==100]

Then pick a random element from them:

myChoice = random.choice(answer)

EDIT:

The cartesian product (computed with itertools.product) itself doesn't eat much RAM (this is because itertools.product returns a generator, which uses O(1) space, but a lot of time as you pointed out). Only computing the list (answer) does. The bad news is that in order to use random.choice, you do need a list. If you really don't want to use a list, then you might need to come up with a decaying probability function. Here's a very simple probability function that you could use:

def constrainedRandoms(ranges):
    choices = (v for v in itertools.product(*ranges) if sum(v)==100) # note the parentheses. This is now a generator as well

    prob = 0.5
    decayFactor = 0.97 # set this parameter yourself, to better suit your needs
    try:
        answer = next(choices)
    except StopIteration:
        answer = None
    for choice in choices:
        answer = choice
        if random.uniform(0,1) >= prob:
            return answer
        prob *= decayFactor
    return answer

The decaying probability allows you increase the probability with which you will select the next vector that fits your constraints. Think about it this way: you have a bunch of constraints. Chances are, that you'll have a relatively small number of vectors that satisfy these constraints. This means that every time you decide not to use a vector, the probability that there is another such vector decreases. Therefore, over time, you need to be progressively more biased towards returning the current vector as "the randomly selected vector". Of course, it is still possible to go through the entire loop structure without ever returning a vector. This is why the code starts with a default of the first vector that fits the constraints (assuming one exists) and returns the last such vector if the decaying probability never allows for a vector to be returned.
Note that this decaying probability idea allows you to not have to iterate through all candidate vectors. With time, the probability of the code returning the current vector under consideration increases, thus reducing the probability of it continuing on and considering other vectors. This idea therefore helps mitigate your concerns about the runtime as well (though, I'm embarrassed to add, not by very much)

like image 38
inspectorG4dget Avatar answered Sep 18 '22 00:09

inspectorG4dget