I want to sample 140 numbers between 1000 to 100000 such that the sum of these 140 numbers is around 2 million (2000000):
sample(1000:100000,140)
such that:
sum(sample(1000:100000,140)) = 2000000
Any pointers how I can achieve this?
There are four primary, random (probability) sampling methods – simple random sampling, systematic sampling, stratified sampling, and cluster sampling.
One way of obtaining a random sample is to give each individual in a population a number, and then use a table of random numbers to decide which individuals to include.
Method of lottery Using the lottery method is one of the oldest ways and is a mechanical example of random sampling. In this method, the researcher gives each member of the population a number. Researchers draw numbers from the box randomly to choose samples.
An example of a simple random sample would be the names of 25 employees being chosen out of a hat from a company of 250 employees. In this case, the population is all 250 employees, and the sample is random because each employee has an equal chance of being chosen.
There exist an algorithm for generating such random numbers.
Originally created for MATLAB, there is an R implementation of it:
Surrogate::RandVec
Citation from MATLAB script comment:
% This generates an n by m array x, each of whose m columns
% contains n random values lying in the interval [a,b], but
% subject to the condition that their sum be equal to s. The
% scalar value s must accordingly satisfy n*a <= s <= n*b. The
% distribution of values is uniform in the sense that it has the
% conditional probability distribution of a uniform distribution
% over the whole n-cube, given that the sum of the x's is s.
%
% The scalar v, if requested, returns with the total
% n-1 dimensional volume (content) of the subset satisfying
% this condition. Consequently if v, considered as a function
% of s and divided by sqrt(n), is integrated with respect to s
% from s = a to s = b, the result would necessarily be the
% n-dimensional volume of the whole cube, namely (b-a)^n.
%
% This algorithm does no "rejecting" on the sets of x's it
% obtains. It is designed to generate only those that satisfy all
% the above conditions and to do so with a uniform distribution.
% It accomplishes this by decomposing the space of all possible x
% sets (columns) into n-1 dimensional simplexes. (Line segments,
% triangles, and tetrahedra, are one-, two-, and three-dimensional
% examples of simplexes, respectively.) It makes use of three
% different sets of 'rand' variables, one to locate values
% uniformly within each type of simplex, another to randomly
% select representatives of each different type of simplex in
% proportion to their volume, and a third to perform random
% permutations to provide an even distribution of simplex choices
% among like types. For example, with n equal to 3 and s set at,
% say, 40% of the way from a towards b, there will be 2 different
% types of simplex, in this case triangles, each with its own
% area, and 6 different versions of each from permutations, for
% a total of 12 triangles, and these all fit together to form a
% particular planar non-regular hexagon in 3 dimensions, with v
% returned set equal to the hexagon's area.
%
% Roger Stafford - Jan. 19, 2006
Example:
test <- Surrogate::RandVec(a=1000, b=100000, s=2000000, n=140, m=1, Seed=sample(1:1000, size = 1))
sum(test$RandVecOutput)
# 2000000
hist(test$RandVecOutput)
Here is a hit and miss approach. The basic idea is that finding 140 numbers which sum to 2000000 is the same as breaking 1:2000000 into 140 pieces, which requires 139 cutpoints. Also, note that the minimum of 1000 is somewhat annoying. Just subtract it from all the problem data and add it back in at the end:
rand.nums <- function(a,b,n,k){
#finds n random integers in range a:b which sum to k
while(TRUE){
x <- sample(1:(k - n*a),n-1, replace = TRUE) #cutpoints
x <- sort(x)
x <- c(x,k-n*a) - c(0,x)
if(max(x) <= b-a) return(a+x)
}
}
Then rand.nums(1000,100000,140,2000000)
evaluates to 140 integers in the given range which sum to 2000000. For these choices of parameters, the function returns almost instantly. For other choices of the parameters, a solution might be either impossible or so tightly constrained that finding one by chance is effectively impossible. Thus, caution needs to be taken in using the function. It could be modified by adding a maxtrials
parameter and returning NA
if maxtrials is exceeded without finding a solution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With