Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm for sampling without replacement?

I am trying to test the likelihood that a particular clustering of data has occurred by chance. A robust way to do this is Monte Carlo simulation, in which the associations between data and groups are randomly reassigned a large number of times (e.g. 10,000), and a metric of clustering is used to compare the actual data with the simulations to determine a p value.

I've got most of this working, with pointers mapping the grouping to the data elements, so I plan to randomly reassign pointers to data. THE QUESTION: what is a fast way to sample without replacement, so that every pointer is randomly reassigned in the replicate data sets?

For example (these data are just a simplified example):

Data (n=12 values) - Group A: 0.1, 0.2, 0.4 / Group B: 0.5, 0.6, 0.8 / Group C: 0.4, 0.5 / Group D: 0.2, 0.2, 0.3, 0.5

For each replicate data set, I would have the same cluster sizes (A=3, B=3, C=2, D=4) and data values, but would reassign the values to the clusters.

To do this, I could generate random numbers in the range 1-12, assign the first element of group A, then generate random numbers in the range 1-11 and assign the second element in group A, and so on. The pointer reassignment is fast, and I will have pre-allocated all data structures, but the sampling without replacement seems like a problem that might have been solved many times before.

Logic or pseudocode preferred.

like image 433
Argalatyr Avatar asked Nov 22 '08 19:11

Argalatyr


People also ask

What is the formula of sampling without replacement?

For random samples of size n selected without replacement from an underlying population, the variance of the mean of all possible samples is equal to the modified variance of the underlying population divided by the sample size, multiplied by the finite population correction (FPC) factor.

How do you construct a sampling distribution without replacement?

In sampling without replacement, the formula for the standard deviation of all sample means for samples of size n must be modified by including a finite population correction. The formula becomes: where N is the population size, N=6 in this example, and n is the sample size, n=4 in this case.

What is a sampling algorithm?

A sampling algorithm is a procedure that allows us to select randomly a subset of units (a sample) from a population without enumerating all the possible samples of the population.

What is sampling without replacement?

Sampling Without Replacement. Thus the size of the population decreases as the sample size n increases. The sample size n cannot exceed the population size N. Once the unit is selected for a sample it cannot be repeated in the same sample. Thus all the units of the sample are distinct from one another.

What is an example of a sample with replacement?

When we sample with replacement, the items in the sample are independent because the outcome of one random draw is not affected by the previous draw. For example, the probability of choosing the name Tyler is 1/5 on the first draw and 1/5 again on the second draw.

What is sampling with replacement in machine learning?

Sampling with replacement is used in many different scenarios in statistics and machine learning, including: In each of these methods, sampling with replacement is used because it allows us to use the same dataset multiple times to build models as opposed to going out and gathering new data, which can be time-consuming and expensive.

What are the two different ways to collect samples?

There are two different ways to collect samples: Sampling with replacement and sampling without replacement. This tutorial explains the difference between the two methods along with examples of when each is used in practice.


1 Answers

Here's some code for sampling without replacement based on Algorithm 3.4.2S of Knuth's book Seminumeric Algorithms.

void SampleWithoutReplacement
(
    int populationSize,    // size of set sampling from
    int sampleSize,        // size of each sample
    vector<int> & samples  // output, zero-offset indicies to selected items
)
{
    // Use Knuth's variable names
    int& n = sampleSize;
    int& N = populationSize;

    int t = 0; // total input records dealt with
    int m = 0; // number of items selected so far
    double u;

    while (m < n)
    {
        u = GetUniform(); // call a uniform(0,1) random number generator

        if ( (N - t)*u >= n - m )
        {
            t++;
        }
        else
        {
            samples[m] = t;
            t++; m++;
        }
    }
}

There is a more efficient but more complex method by Jeffrey Scott Vitter in "An Efficient Algorithm for Sequential Random Sampling," ACM Transactions on Mathematical Software, 13(1), March 1987, 58-67.

like image 179
John D. Cook Avatar answered Sep 18 '22 06:09

John D. Cook