Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sample with a max

Tags:

r

sampling

If I want to sample numbers to create a vector I do:

set.seed(123)
x <- sample(1:100,200, replace = TRUE)
sum(x)
# [1] 10228

What if I want to sample 20 random numbers that sum to 100, and then 30 numbers but still sum to 100. This I imagine will be more of a challenge than it seems. ?sample and searching Google has not provided me with a clue. And a loop to sample then reject if not close enough( e.g. within 5) of the desired sum I guess may take some time.

Is there a better way to achieve this?

an example would be:

foo(10,100) # ten random numbers that sum to 100. (not including zeros)
# 10,10,20,7,8,9,4,10,2,20
like image 749
user1320502 Avatar asked Feb 04 '13 10:02

user1320502


People also ask

Is sample maximum a statistic?

The sample maximum and minimum are basic summary statistics, showing the most extreme observations, and are used in the five-number summary and a version of the seven-number summary and the associated box plot.

What is the minimum value for a large sample?

Most statisticians agree that the minimum sample size to get any kind of meaningful result is 100. If your population is less than 100 then you really need to survey all of them.

What is Max in statistics?

The maximum is the largest value in the data set.

Is sample size a statistic?

Sure, sample size is a statistic. A “parameter” is a knob you turn to get some distribution to behave a certain way.


1 Answers

An attempt using R

# Config
n <- 20L
target <- 100L
vec <- seq(100)
set.seed(123)

# R repeat loop
sumto_repeat <- function(vec,n,target) {
  res <- integer()
  repeat {
    cat("begin:",sum(res),length(res),"\n")
    res <- c( res, sample(vec,1) )
    if( sum(res)<target & length(res)==(n-1) ) {
      res[length(res)+1] <- target - sum(res)
    }
    # cat("mid:",sum(res),length(res),"\n")
    if(sum(res)>target) res <- res[-length(res)]
    if( length(res)>n | length(res)<n & sum(res)==target ) {
      res <- res[-sample(seq(length(res)),1)]
    }
    # cat("end:",sum(res),length(res),"\n")
    # cat(dput(res),"\n")
    if( sum(res)==target & length(res)==n ) break
  }
  res
}

test <- sumto_repeat(vec=vec,n=n,target=target)
> sum(test)
[1] 100
> length(test)
[1] 20

Also, I'd give some thought to what distribution you'd like to be drawing from. I think that there are a few different ways of getting it to sum to exactly target with n elements (for instance, you could make the last element always be target - sum(res)) that may or may not have different distributional implications.

A very similar algorithm in Rcpp, for speeeeed!

cpp_src <- '
Rcpp::IntegerVector xa = clone(x); // Vector to be sampled
Rcpp::IntegerVector na(n); // Number of elements in solution
Rcpp::IntegerVector sa(s); // Sum of solution

int nsampled;
int currentSum;
int dropRandomIndex;
int numZeroes;
Rcpp::IntegerVector remainingQuantity(1);
int maxAttempts = 100;

// Create container for our results
Rcpp::IntegerVector res(maxAttempts);
std::fill( res.begin(), res.end(), NA_INTEGER );

// Calculate min/max so that we can draw random integers from within range
Rcpp::IntegerVector::iterator mn = std::min_element(xa.begin(), xa.end()) ;
Rcpp::IntegerVector::iterator mx = std::max_element(xa.begin(), xa.end()) ;
std::cout << "mx = " << *mx << std::endl;

// Now draw repeatedly
nsampled = 0;
for( int i = 0; i < maxAttempts; i++ ) {
  std::cout << "\\n" << i;
  int r = *mn + (rand() % (int)(*mx - *mn + 1));
  res[i] = xa[r+1];
  // Calculate n and s for current loop iteration
  numZeroes = 0;
  for( int j = 0; j < maxAttempts; j++) 
    if(res[j]==0) numZeroes++;
  std::cout << " nz= " << numZeroes ;
  nsampled = maxAttempts - sum( is_na(res) ) - numZeroes - 1;
  currentSum = std::accumulate(res.begin(),res.begin()+i,0); // Cant just use Rcpp sugar sum() here because it freaks at the NAs
  std::cout << " nsamp= " << nsampled << " sum= " << currentSum;
  if(nsampled == na[0]-1) {  
    std::cout << " One element away. ";
    remainingQuantity[0] = sa[0] - currentSum;
    std::cout << "remainingQuantity = " << remainingQuantity[0];
    if( (remainingQuantity[0] > 0) && (remainingQuantity[0]) < *mx ) {
      std::cout << "Within range.  Prepare the secret (cheating) weapon!\\n";
      std::cout << sa[0] << " ";
      std::cout << currentSum << " ";
      std::cout << remainingQuantity[0] << std::endl;
      if( i != maxAttempts ) {
        std::cout << "Safe to add one last element on the end.  Doing so.\\n";
        res[i] = remainingQuantity[0];
      }
      currentSum = sa[0];
      nsampled++;
      if(nsampled == na[0] && currentSum == sa[0]) std::cout << "It should end after this...nsamp= " << nsampled << " and currentSum= " << currentSum << std::endl;
      break;
    } else {
      std::cout << "Out of striking distance.  Dropping random element\\n";
      dropRandomIndex = 0 + (rand() % (int)(i - 0 + 1));
      res[dropRandomIndex] = 0;
    }
  }
  if(nsampled == na[0] && currentSum == sa[0]) {
      std::cout << "Success!\\n";
      for(int l = 0; l <= i+1; l++) 
        std::cout << res[l] << " " ;
      break;
  }
  if(nsampled == na[0] && currentSum != sa[0]) {
    std::cout << "Reached number of elements but sum is ";
    if(currentSum > sa[0]) {
      std::cout << "Too high. Blitz everything and start over!\\n";
      for(int k = 0; k < res.size(); k++) {
        res[k] = NA_INTEGER;
      }
    } else {
      std::cout << "Too low.  \\n";

    }
  }
  if( nsampled < na[0] && currentSum >= sa[0] ) {
    std::cout << "Too few elements but at or above the sum cutoff.  Dropping a random element and trying again.\\n";
    dropRandomIndex = 0 + (rand() % (int)(i - 0 + 1));
    res[dropRandomIndex] = 0;
  }
}
return res;
'

sumto <- cxxfunction( signature(x="integer", n="integer", s="integer"), body=cpp_src, plugin="Rcpp", verbose=TRUE )

testresult <- sumto(x=x, n=20L, s=1000L)
testresult <- testresult[!is.na(testresult)]
testresult <- testresult[testresult!=0]
testresult
cumsum(testresult)
length(testresult)

Tried it with a few different values, and produces valid answers unless it runs away. There's a caveat here, which is that it cheats if it's one away from the desired number of elements and within "striking distance" -- e.g. rather than just drawing the last value it calculates it if that number is valid.

Benchmarks

See gist for comparison code.

benchmarks

like image 52
Ari B. Friedman Avatar answered Sep 28 '22 00:09

Ari B. Friedman