I want to generate N random numbers drawn from a specif distribution (e.g uniform random) between [a,b] which sum to a constant C. I have tried a couple of solutions I could think of myself, and some proposed on similar threads but most of them either work for a limited form of problem or I can't prove the outcome still follows the desired distribution. What I have tried: Generage N random numbers, divide all of them by the sum of them and multiply by the desired constant. This seems to work but the result does not follow the rule that the numbers should be within [a:b]. Generage N-1 random numbers add 0 and desired constant C and sort them. Then calculate the difference between each two consecutive nubmers and the differences are the result. This again sums to C but have the same problem of last method(the range can be bigger than [a:b]. I also tried to generate random numbers and always keep track of min and max in a way that the desired sum and range are kept and come up with this code: <pre class="prettyprint"><code>bool generate(function<int(int,int)> randomGenerator,int min,int max,int len,int sum,std::vector<int> &output){ /** * Not possible to produce such a sequence */ if(min*len > sum) return false; if(max*len < sum) return false; int curSum = 0; int left = sum - curSum; int leftIndexes = len-1; int curMax = left - leftIndexes*min; int curMin = left - leftIndexes*max; for(int i=0;i<len;i++){ int num = randomGenerator((curMin< min)?min:curMin,(curMax>max)?max:curMax); output.push_back(num); curSum += num; left = sum - curSum; leftIndexes--; curMax = left - leftIndexes*min; curMin = left - leftIndexes*max; } return true; } </code></pre> This seems to work but the results are sometimes very skewed and I don't think it's following the original distribution (e.g. uniform). E.g: <pre class="prettyprint"><code>//10 numbers within [1:10] which sum to 50: generate(uniform,1,10,10,50,output); //result: 2,7,2,5,2,10,5,8,4,5 => sum=50 //This looks reasonable for uniform, but let's change to //10 numbers within [1:25] which sum to 50: generate(uniform,1,25,10,50,output); //result: 24,12,6,2,1,1,1,1,1,1 => sum= 50 </code></pre> Notice how many ones exist in the output. This might sound reasonable because the range is larger. But they really don't look like a uniform distribution. I am not sure even if it is possible to achieve what I want, maybe the constraints are making the problem not solvable.

In case you want the sample to follow a uniform distribution, the problem reduces to generate N random numbers with sum = 1. This, in turn, is a special case of the Dirichlet distribution but can also be computed more easily using the Exponential distribution. Here is how: <ol> <li>Take a uniform sample v1 … vN with all vi between 0 and 1.</li> <li>For all i, 1<=i<=N, define ui := -ln vi (notice that ui > 0). </li> <li>Normalize the ui as pi := ui/s where s is the sum u1+...+uN.</li> </ol> The p1..pN are uniformly distributed (in the simplex of dim N-1) and their sum is 1. You can now multiply these pi by the constant C you want and translate them by summing some other constant A like this qi := A + pi*C. EDIT 3 In order to address some issues raised in the comments, let me add the following: <ul> <li>To ensure that the final random sequence falls in the interval [a,b] choose the constants A and C above as A := a and C := b-a, i.e., take qi = a + pi*(b-a). Since pi is in the range (0,1) all qi will be in the range [a,b].</li> <li>One cannot take the (negative) logarithm -ln(vi) if vi happens to be 0 because ln() is not defined at 0. The probability of such an event is extremely low. However, in order to ensure that no error is signaled the generation of v1 ... vN in item 1 above must threat any occurrence of 0 in a special way: consider -ln(0) as +infinity (remember: ln(x) -> -infinity when x->0). Thus the sum s = +infinity, which means that pi = 1 and all other pj = 0. Without this convention the sequence (0...1...0) would never be generated (many thanks to @Severin Pappadeux for this interesting remark.)</li> <li>As explained in the 4th comment attached to the question by @Neil Slater it is logically impossible to fulfill all the requirements of the original framing. Therefore any solution must relax the constraints to a proper subset of the original ones. Other comments by @Behrooz seem to confirm that this would suffice in this case.</li> </ul> EDIT 2 One more issue has been raised in the comments: Why rescaling a uniform sample does not suffice? In other words, why should I bother to take negative logarithms? The reason is that if we just rescale then the resulting sample won't distribute uniformly across the segment (0,1) (or [a,b] for the final sample.) To visualize this let's think 2D, i.e., let's consider the case N=2. A uniform sample (v1,v2) corresponds to a random point in the square with origin (0,0) and corner (1,1). Now, when we normalize such a point dividing it by the sum s=v1+v2 what we are doing is projecting the point onto the diagonal as shown in the picture (keep in mind that the diagonal is the line x + y = 1): <img src="https://i.stack.imgur.com/Idjhx.png" alt="enter image description here"> But given that green lines, which are closer to the principal diagonal from (0,0) to (1,1), are longer than orange ones, which are closer to the axes x and y, the projections tend to accumulate more around the center of the projection line (in blue), where the scaled sample lives. This shows that a simple scaling won't produce a uniform sample on the depicted diagonal. On the other hand, it can be proven mathematically that the negative logarithms do produce the desired uniformity. So, instead of copypasting a mathematical proof I would invite everyone to implement both algorithms and check that the resulting plots behave as this answer describes. (Note: here is a blog post on this interesting subject with an application to the Oil & Gas industry)

Generate N random numbers within a range with a constant sum

Tags:

c++

algorithm

random

range

sum

I want to generate N random numbers drawn from a specif distribution (e.g uniform random) between [a,b] which sum to a constant C. I have tried a couple of solutions I could think of myself, and some proposed on similar threads but most of them either work for a limited form of problem or I can't prove the outcome still follows the desired distribution.

What I have tried: Generage N random numbers, divide all of them by the sum of them and multiply by the desired constant. This seems to work but the result does not follow the rule that the numbers should be within [a:b].

Generage N-1 random numbers add 0 and desired constant C and sort them. Then calculate the difference between each two consecutive nubmers and the differences are the result. This again sums to C but have the same problem of last method(the range can be bigger than [a:b].

I also tried to generate random numbers and always keep track of min and max in a way that the desired sum and range are kept and come up with this code:

bool generate(function<int(int,int)> randomGenerator,int min,int max,int len,int sum,std::vector<int> &output){
    /**
    * Not possible to produce such a sequence
    */
if(min*len > sum)
    return false;
if(max*len < sum)
    return false;

int curSum = 0;
int left = sum - curSum;
int leftIndexes = len-1;
int curMax = left - leftIndexes*min;
int curMin = left - leftIndexes*max;

for(int i=0;i<len;i++){
    int num = randomGenerator((curMin< min)?min:curMin,(curMax>max)?max:curMax);
    output.push_back(num);
    curSum += num;
    left = sum - curSum;
    leftIndexes--;
    curMax = left - leftIndexes*min;
    curMin = left - leftIndexes*max;
}

return true;
}

This seems to work but the results are sometimes very skewed and I don't think it's following the original distribution (e.g. uniform). E.g:

//10 numbers within [1:10] which sum to 50:
generate(uniform,1,10,10,50,output);
//result:
2,7,2,5,2,10,5,8,4,5 => sum=50
//This looks reasonable for uniform, but let's change to 
//10 numbers within [1:25] which sum to 50:
generate(uniform,1,25,10,50,output);
//result:
24,12,6,2,1,1,1,1,1,1 => sum= 50

Notice how many ones exist in the output. This might sound reasonable because the range is larger. But they really don't look like a uniform distribution. I am not sure even if it is possible to achieve what I want, maybe the constraints are making the problem not solvable.

762

asked Mar 21 '15 19:03

Behrooz

Video Answer

2 Answers

In case you want the sample to follow a uniform distribution, the problem reduces to generate N random numbers with sum = 1. This, in turn, is a special case of the Dirichlet distribution but can also be computed more easily using the Exponential distribution. Here is how:

Take a uniform sample v₁ … v_N with all v_i between 0 and 1.
For all i, 1<=i<=N, define u_i := -ln v_i (notice that u_i > 0).
Normalize the u_i as p_i := u_i/s where s is the sum u₁+...+u_N.

The p₁..p_N are uniformly distributed (in the simplex of dim N-1) and their sum is 1.

You can now multiply these p_i by the constant C you want and translate them by summing some other constant A like this

q_i := A + p_i*C.

EDIT 3

In order to address some issues raised in the comments, let me add the following:

To ensure that the final random sequence falls in the interval [a,b] choose the constants A and C above as A := a and C := b-a, i.e., take q_i = a + p_i*(b-a). Since p_i is in the range (0,1) all q_i will be in the range [a,b].
One cannot take the (negative) logarithm -ln(v_i) if v_i happens to be 0 because ln() is not defined at 0. The probability of such an event is extremely low. However, in order to ensure that no error is signaled the generation of v₁ ... v_N in item 1 above must threat any occurrence of 0 in a special way: consider -ln(0) as +infinity (remember: ln(x) -> -infinity when x->0). Thus the sum s = +infinity, which means that p_i = 1 and all other p_j = 0. Without this convention the sequence (0...1...0) would never be generated (many thanks to @Severin Pappadeux for this interesting remark.)
As explained in the 4th comment attached to the question by @Neil Slater it is logically impossible to fulfill all the requirements of the original framing. Therefore any solution must relax the constraints to a proper subset of the original ones. Other comments by @Behrooz seem to confirm that this would suffice in this case.

EDIT 2

One more issue has been raised in the comments:

Why rescaling a uniform sample does not suffice?

In other words, why should I bother to take negative logarithms?

The reason is that if we just rescale then the resulting sample won't distribute uniformly across the segment (0,1) (or [a,b] for the final sample.)

To visualize this let's think 2D, i.e., let's consider the case N=2. A uniform sample (v₁,v₂) corresponds to a random point in the square with origin (0,0) and corner (1,1). Now, when we normalize such a point dividing it by the sum s=v₁+v₂ what we are doing is projecting the point onto the diagonal as shown in the picture (keep in mind that the diagonal is the line x + y = 1):

enter image description here

But given that green lines, which are closer to the principal diagonal from (0,0) to (1,1), are longer than orange ones, which are closer to the axes x and y, the projections tend to accumulate more around the center of the projection line (in blue), where the scaled sample lives. This shows that a simple scaling won't produce a uniform sample on the depicted diagonal. On the other hand, it can be proven mathematically that the negative logarithms do produce the desired uniformity. So, instead of copypasting a mathematical proof I would invite everyone to implement both algorithms and check that the resulting plots behave as this answer describes.

(Note: here is a blog post on this interesting subject with an application to the Oil & Gas industry)

answered Oct 29 '22 11:10

Leandro Caniglia

Let's try to simplify the problem. By substracting the lower bound, we can reduce it to finding N numbers in [0,b-a] such that their sum is C-Na.

Renaming the parameters, we can look for N numbers in [0,m] whose sum is S.

Now the problem is akin to partitioning a segment of length S in N distinct sub-segments of length [0,m].

I think the problem is simply not solvable.

if S=1, N=1000 and m anything above 0, the only possible repartition is one 1 and 999 zeroes, which is nothing like a random spread.

There is a correlation between N, m and S, and even picking random values will not make it disappear.

For the most uniform repartition, the length of the sub-segments will follow a gaussian curve with a mean value of S/N.

If you tweak your random numbers differently, you will end up with whatever bias, but in the end you will never have both a uniform [a,b] repartition and a total length of C, unless the length of your [a,b] interval happens to be 2C/N-a.

answered Oct 29 '22 11:10

kuroi neko

Related questions
                            
                                Fastest method for calculating convolution
                            
                                Variables in Locals and Expressions not accessible in QT Creator
                            
                                No matching constructor for initialization of
                            
                                Difference between opening a file in binary vs text [duplicate]
                            
                                Is there an analogue of an object's `this`, but for functions?
                            
                                Errors creating std::vector of local structure
                            
                                How to determine the type of an array element?
                            
                                Display popup in Win32 Console Application
                            
                                opencv C++ create Mat object from android NV21 image data buffer
                            
                                How does string work with non-ascii symbols while char does not?
                            
                                Move semantics and perfect forwarding difference
                            
                                Using rvalue references for default arguments
                            
                                creating unordered_set with lambda
                            
                                "Type" does not refer to a value on C++
                            
                                C++ std::istream readsome doesn't read anything
                            
                                c++11 get type of first (second, etc...) argument, similar to result_of
                            
                                SSL certificates and Boost asio
                            
                                C++ randomly sample k numbers from range 0:n-1 (n > k) without replacement
                            
                                Why doesn't std::reference_wrapper implicitly cast to a reference when calling member function? [duplicate]
                            
                                C++14 using alias for is_same::value

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With