I am trying to test the likelihood that a particular clustering of data has occurred by chance. A robust way to do this is Monte Carlo simulation, in which the associations between data and groups are randomly reassigned a large number of times (e.g. 10,000), and a metric of clustering is used to compare the actual data with the simulations to determine a p value. I've got most of this working, with pointers mapping the grouping to the data elements, so I plan to randomly reassign pointers to data. THE QUESTION: what is a fast way to sample without replacement, so that every pointer is randomly reassigned in the replicate data sets? For example (these data are just a simplified example): <blockquote> Data (n=12 values) - Group A: 0.1, 0.2, 0.4 / Group B: 0.5, 0.6, 0.8 / Group C: 0.4, 0.5 / Group D: 0.2, 0.2, 0.3, 0.5 </blockquote> For each replicate data set, I would have the same cluster sizes (A=3, B=3, C=2, D=4) and data values, but would reassign the values to the clusters. To do this, I could generate random numbers in the range 1-12, assign the first element of group A, then generate random numbers in the range 1-11 and assign the second element in group A, and so on. The pointer reassignment is fast, and I will have pre-allocated all data structures, but the sampling without replacement seems like a problem that might have been solved many times before. Logic or pseudocode preferred.

Here's some code for sampling without replacement based on Algorithm 3.4.2S of Knuth's book Seminumeric Algorithms. <pre class="prettyprint"><code>void SampleWithoutReplacement ( int populationSize, // size of set sampling from int sampleSize, // size of each sample vector<int> & samples // output, zero-offset indicies to selected items ) { // Use Knuth's variable names int& n = sampleSize; int& N = populationSize; int t = 0; // total input records dealt with int m = 0; // number of items selected so far double u; while (m < n) { u = GetUniform(); // call a uniform(0,1) random number generator if ( (N - t)*u >= n - m ) { t++; } else { samples[m] = t; t++; m++; } } } </code></pre> There is a more efficient but more complex method by Jeffrey Scott Vitter in "An Efficient Algorithm for Sequential Random Sampling," ACM Transactions on Mathematical Software, 13(1), March 1987, 58-67.

Algorithm for sampling without replacement?

Tags:

algorithm

statistics

pseudocode

I am trying to test the likelihood that a particular clustering of data has occurred by chance. A robust way to do this is Monte Carlo simulation, in which the associations between data and groups are randomly reassigned a large number of times (e.g. 10,000), and a metric of clustering is used to compare the actual data with the simulations to determine a p value.

I've got most of this working, with pointers mapping the grouping to the data elements, so I plan to randomly reassign pointers to data. THE QUESTION: what is a fast way to sample without replacement, so that every pointer is randomly reassigned in the replicate data sets?

For example (these data are just a simplified example):

Data (n=12 values) - Group A: 0.1, 0.2, 0.4 / Group B: 0.5, 0.6, 0.8 / Group C: 0.4, 0.5 / Group D: 0.2, 0.2, 0.3, 0.5

For each replicate data set, I would have the same cluster sizes (A=3, B=3, C=2, D=4) and data values, but would reassign the values to the clusters.

To do this, I could generate random numbers in the range 1-12, assign the first element of group A, then generate random numbers in the range 1-11 and assign the second element in group A, and so on. The pointer reassignment is fast, and I will have pre-allocated all data structures, but the sampling without replacement seems like a problem that might have been solved many times before.

Logic or pseudocode preferred.

433

asked Nov 22 '08 19:11

Argalatyr

1 Answers

Here's some code for sampling without replacement based on Algorithm 3.4.2S of Knuth's book Seminumeric Algorithms.

Click to copy

void SampleWithoutReplacement
(
    int populationSize,    // size of set sampling from
    int sampleSize,        // size of each sample
    vector<int> & samples  // output, zero-offset indicies to selected items
)
{
    // Use Knuth's variable names
    int& n = sampleSize;
    int& N = populationSize;

    int t = 0; // total input records dealt with
    int m = 0; // number of items selected so far
    double u;

    while (m < n)
    {
        u = GetUniform(); // call a uniform(0,1) random number generator

        if ( (N - t)*u >= n - m )
        {
            t++;
        }
        else
        {
            samples[m] = t;
            t++; m++;
        }
    }
}

There is a more efficient but more complex method by Jeffrey Scott Vitter in "An Efficient Algorithm for Sequential Random Sampling," ACM Transactions on Mathematical Software, 13(1), March 1987, 58-67.

179

answered Sep 18 '22 06:09

John D. Cook

Related questions
                            
                                For inputs of size n, for which values of n does insertion-sort beat merge-sort? [closed]
                            
                                How to generate a unique hash for a URL?
                            
                                How does language detection work?
                            
                                Print the biggest K elements in a given heap in O(K*log(K))?
                            
                                Generating random number between [-1, 1] in C?
                            
                                Number which occurs only once in the array [duplicate]
                            
                                How to check if a Graph is a Planar Graph or not?
                            
                                Algorithm to convert a multi-dimensional array to a one-dimensional array
                            
                                Usage examples of greedy algorithms?
                            
                                Probability of 64bit Hash Code Collisions
                            
                                Efficiently reverse the order of the words (not characters) in an array of characters
                            
                                How does the Levenberg–Marquardt algorithm work in detail but in an understandable way?
                            
                                How to implement a breadth first search to a certain depth?
                            
                                How do I convert decimal numbers to binary in Perl?
                            
                                Color gradient algorithm
                            
                                Efficiently find whether a string contains a group of characters (like substring but ignoring order)?
                            
                                Algorithm to check if directed graph is strongly connected
                            
                                How do I figure out the least number of characters to create a palindrome?
                            
                                How to "smart resize" a displayed image to original aspect ratio
                            
                                Very simple text classification by machine learning? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With