I need to generate n percentages (integers between 0 and 100) such that the sum of all n numbers adds up to 100.
If I just do nextInt()
n times, each time ensuring that the parameter is 100 minus the previously accumulated sum, then my percentages are biased (i.e. the first generated number will usually be largest etc.). How do I do this in an unbiased way?
A couple of answers suggest picking random percents and taking the differences between them. As Nikita Ryback points out, this will not give the uniform distribution over all possibilities; in particular, zeroes will be less frequent than expected.
To fix this, think of starting with 100 'percents' and inserting dividers. I will show an example with 10:
% % % % % % % % % %
There are eleven places we could insert a divider: between any two percents or at the beginning or end. So insert one:
% % % % / % % % % % %
This represents choosing four and six. Now insert another divider. This time, there are twelve places, because the divider already inserted creates and extra one. In particular, there are two ways to get
% % % % / / % % % % % %
either inserting before or after the previous divider. You can continue the process until you have as many dividers as you need (one fewer than the number of percents.)
% % / % / % / / % % % / % % % /
This corresponds to 2,1,1,0,3,3,0.
We can prove that this gives the uniform distribution. The number of compositions of 100 into k parts is the binomial coefficient 100+k-1 choose k-1. That is (100+k-1)(100+k-2)...101 / (k-1)(k-2)*...*2*1 Thus the probability of choosing any particular composition is the reciprocal of this. As we insert dividers one at a time, first we choose from 101 positions, then 102, 103, etc until we get to 100+k-1. So the probability of any particular sequence of insertions is 1 / (100+k-1)*...*101. How many insertion sequences give rise to the same composition? The final composition contains k-1 dividers. They could have been inserted in any order, so there are (k-1)! sequences that give rise to a given composition. So the probability of any particular composition is exactly what it should be.
In actual code, you probably wouldn't represent your steps like this. You should be able to just hold on to numbers, rather than sequences of percents and dividers. I haven't thought about the complexity of this algorithm.
Generate n random integers with any range (call them a[1]
..a[n]
). Sum up your integers and call that b
. Your percentages will be [a[1]/b, ..., a[n]/b]
.
Edit: good points, rounding the results to total exactly 100 is non-trival. One approach would be to take the floor of a[x]/b
for x
in 1..n
as your integers, then distribute the remainding units 100-(sum of integers)
randomly. I'm not sure if this would introduce any bias into the result.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With