Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Randomly assign people into different size groups and category

Tags:

r

I need to assign people randomly into groups and a category. Unfortunately, I don’t really know where to start with this. I have tried to explain my problem using the example below. Any help with this would be very much appreciated.

I have 207 ‘Home Type A’ and 408 ‘Home Type B’ categories. In total there are 1524 people that need to be assigned to either the 207 Home Type A’s or the 408 Home Type B’s categories. However, the 1524 people also need to be grouped together in either groups of 2 to 7 for Home Type A or of 2 to 6 for Home Type B .

The end result should be 1524 people assigned to 207 groups (containing between 2 to 7 people) and 408 groups (containing between 2 to 6 people).

The assignment to groups has to be random and can use any combination of group sizes required as it does not matter if a group category is not used (for example, it would be fine if the 207 groups for Home Type A only contained 2, 3 or 4 people one time, or only 5 and 7 people another).

I imagine an output that would look something like this:

GroupSize <- c(2:7)
Num.Groups <- 0
Num.People <- 0
HouseTypeA <- data.frame(GroupSize, Num.Groups, Num.People)
GroupSize <- c(2:6)
HouseTypeB <- data.frame(GroupSize, Num.Groups, Num.People)

With the 'Num.Groups' column summing to either 207 or 408 and the sum of 'Num.People' between the two data frames being 1524.

like image 603
Chris Avatar asked Apr 22 '16 11:04

Chris


People also ask

How do you randomly assign participants to groups in Excel?

The function RAND () is Excel's random number generator. To use it, in Column C, type in the following = RAND() in each cell adjacent to every name. Or you can type this function in the top row (row 2) and simply copy and paste to the entire column, or click and drag.

How do you randomly assign participants to groups?

How do you randomly assign participants to groups? To implement random assignment, assign a unique number to every member of your study's sample. Then, you can use a random number generator or a lottery method to randomly assign each number to a control or experimental group.


2 Answers

I tried this in 3 steps:

  1. Make List of house type
  2. Distribute each person to houses, checking that there is space (max=7 for type a and max=6 for type b)
  3. Check that each house has the minimum of 2 people. If not, grab a person from another house without allowing their number of people to drop below 2.

    homeType=rep(c("a", "b"), times=c(207, 408))
    
    H <- vector(mode="list", length(homeType))
    for(i in seq(H)){
      H[[i]]$type <- homeType[i]
      H[[i]]$n <- 0
    }
    H
    
    # Place people in houses up to max number of people
    npeople <- 1524
    for(i in seq(npeople)){
      placed_in_house <- FALSE
      while(!placed_in_house){
        house_num <- sample(length(H), 1)
        if(H[[house_num]]$type == "a"){
          if(H[[house_num]]$n < 7){
            H[[house_num]]$n <- H[[house_num]]$n + 1
            placed_in_house <- TRUE
          }
        }
        if(H[[house_num]]$type == "b"){
          if(H[[house_num]]$n < 6){
            H[[house_num]]$n <- H[[house_num]]$n + 1
            placed_in_house <- TRUE
          }
        }
      }
    }
    H
    hist(unlist(lapply(H, function(x)x$n)))
    
    # move people around to get up to min number of people
    for(i in seq(H)){
      while(H[[i]]$n < 2){
        knock_on_door <- sample(length(H), 1)
        if( H[[knock_on_door]]$n > 2){
          H[[i]]$n <- H[[i]]$n + 1 # house i takes 1 person
          H[[knock_on_door]]$n <- H[[knock_on_door]]$n - 1 # house knock_on_door loses 1 person
        }
      }
    }
    H
    Ha <- H[which(lapply(H, function(x){x$type}) == "a")]
    Hb <- H[which(lapply(H, function(x){x$type}) == "b")]
    
    
    op <- par(mfcol=c(1,2))
    breaks=2:7
    hist(unlist(lapply(Ha, function(x)x$n)), breaks=breaks, col=8, xlab="people per house", main="type A") # 2:7
    hist(unlist(lapply(Hb, function(x)x$n)), breaks=breaks, col=8, xlab="people per house", main="type B") # 2:6
    par(op)
    
    sum(unlist(lapply(Ha, function(x)x$n))) + sum(unlist(lapply(Hb, function(x)x$n)))
    # [1] 1524
    
    
    Houses <- data.frame(
      Num.Groups = seq(H),
      type=unlist(lapply(H, function(x){x$type})),
      Num.People=unlist(lapply(H, function(x){x$n}))
    )
    head(Houses)
    

    As you can see, the total numbers and distribution of people per house works out.

enter image description here

like image 108
Marc in the box Avatar answered Oct 06 '22 06:10

Marc in the box


There are two outer categories, A and B. Outer category A always has 207 inner categories each of size in 2:7, outer category B always has 408 inner categories each of size in 2:6.

This means:

  • The minimum number of people needed in outer category A must be 207*2 == 414.
  • The maximum number of people possible in outer category A is 207*7 == 1449.
  • The minimum number of people needed in outer category B must be 408*2 == 816.
  • The maximum number of people possible in outer category B is 408*6 == 2448.

We can further conclude:

  • The minimum number of people needed in total is 414+816 == 1230.
  • The maximum number of people possible in total is 1449+2448 == 3897.

I've captured these values as constants assigned at the start of my solution, copied here for reference:

Ainner <- 207L;
Binner <- 408L;
Amin <- Ainner*2L;
Bmin <- Binner*2L;
Amax <- Ainner*7L;
Bmax <- Binner*6L;
NPmin <- Amin+Bmin;
NPmax <- Amax+Bmax;

I designed my solution around the idea that we want to minimize the amount of looping that we need to do in order to find a suitable allocation of people to categories. We can get closer to that goal by initially computing a random division of the total number of people into the two outer categories A and B.

Once that is done, we must select random sizes for each inner category such that the outer category totals hit our division numbers exactly. This is not an easy task; I have not been able to think of a way of using the PRNG primitives to instantaneously produce values that satisfy the constraints. I think we need to iterate.

My solution to minimize looping was to derive normal distributions whose distribution parameters (that is, the mean and standard deviation) are parameterized on the selected outer category sizes (Asize and Bsize in the code), contrived in such a way that the distributions are likely to produce values whose aggregation will be very close to the required total in each outer category, while still providing significant randomness in the inner category size selections. We can then run a loop, making one increment or decrement every iteration as necessary, until we arrive at the required total. The loop is still necessary, but the number of iterations is minimized.


The form of the mean equations follows this pattern:

{extended-min} + {extended-range}*({size}-{min})/({max}-{min})
  • {extended-min} is the bottom end of a range that extends just outside the inner category size range. For example, for outer category A, the inner category size range is 2:7, and my extended range is 1.5:7.5. I used an extended range because I wanted to permit some variance in the random variates produced at the extreme ends of the function, even though they should theoretically approach a limit of the end of the inner category size range. It made the derivation and manipulation of the formula easier than it would have been had I tried to make it reach the theoretically ideal limit at each end, and it's aesthetically more pleasing to plot it this way. Note that the possibility (in fact, certainty) that the normal distribution deviates will fall outside the inner category size range is not a problem, because I cap the values at the endpoints with pmin() and pmax().
  • {extended-range} is the full extended range, e.g. 6 for outer category A.
  • {size} is the size that was randomly chosen for the outer category.
  • {min} and {max} are the minimum and maximum permissible values of the outer category size.

These are the actual equations I derived:

Amean <- function(Asize) 1.5 + 6*(Asize-Amin)/(Amax-Amin);
Bmean <- function(Bsize) 1.5 + 5*(Bsize-Bmin)/(Bmax-Bmin);

The form of the standard deviation equations follows this pattern:

{mult1}*exp(-({mult2}*(2*{size}-{min}-{max})/({max}-{min}))^2)
  • {mult1} and {mult2} are simply multipliers that I fiddled with to get an intuitive behavior of the final normal distributions, based on the goal of mapping the distributions to the required inner category allocations as closely as possible.
  • {size} is the same as before.
  • {min} and {max} are the same as before.

The rationale for the standard deviation form is that the standard deviation will be symmetric about the midpoint of the valid size range, producing a wide normal distribution for middleground sizes, and becoming more narrow towards the extreme ends. Note that the quotient in the exponential is equivalent to this:

({size}-({min}+{max})/2)/(({max}-{min})/2)

Hence it's the deviation of the size from the midpoint of the inner category size range, divided by half the range. That provides a domain of [-1,1], which is then multiplied by {mult2} and squared. The resulting negative exponent gets very large for extreme values, causing the entire exponential to become very small. This small standard deviation is what gives the normal distribution its narrowness towards the ends of the inner category size range.

These are the actual equations:

Asd <- function(Asize) 1.3*exp(-(1.22*(2*Asize-Amin-Amax)/(Amax-Amin))^2);
Bsd <- function(Bsize) 1.3*exp(-(1.22*(2*Bsize-Bmin-Bmax)/(Bmax-Bmin))^2);

Here's some code I wrote to nicely visualize the normal distributions:

Outer Category A

xlim <- c(-3,10);
ylim <- c(0,1.7);
xticks <- seq(xlim[1L],xlim[2L]);
yticks <- seq(ylim[1L],ylim[2L],0.1);
plot(NA,xlim=xlim,ylim=ylim,xlab='Inner Category Size',ylab='P',axes=F);
axis(1L,xticks,xticks);
axis(2L);
box();
abline(v=xticks,col='lightgrey');
abline(h=yticks,col='lightgrey');
x <- seq(xlim[1L],xlim[2L],0.01);
Asize.col <- data.frame(Asize=trunc(seq(Amin,Amax,len=7L)),col=c('red','green','blue','brown','gold','cyan','magenta'),stringsAsFactors=F);
for (ri in seq_len(nrow(Asize.col))) {
    Asize <- Asize.col$Asize[ri];
    col <- Asize.col$col[ri];
    lines(x,dnorm(x,Amean(Asize),Asd(Asize)),col=col,lwd=2);
};
with(Asize.col,legend(-2.5,1.65,Asize,col,col,title=expression(bold(Asize))),cex=0.7);
subEnv <- as.environment(mget(c('Amin','Amax')));
text(0.5,1.6,parse(text=paste0('mu == ',deparse(do.call(substitute,c(list(body(Amean)),subEnv))))),pos=4L);
text(0.5,1.53,parse(text=paste0('sigma == ',deparse(do.call(substitute,c(list(body(Asd)),subEnv))))),pos=4L);

normal-2-A


Outer Category B

xlim <- c(-3,10);
ylim <- c(0,1.7);
xticks <- seq(xlim[1L],xlim[2L]);
yticks <- seq(ylim[1L],ylim[2L],0.1);
plot(NA,xlim=xlim,ylim=ylim,xlab='Inner Category Size',ylab='P',axes=F);
axis(1L,xticks,xticks);
axis(2L);
box();
abline(v=xticks,col='lightgrey');
abline(h=yticks,col='lightgrey');
x <- seq(xlim[1L],xlim[2L],0.01);
Bsize.col <- data.frame(Bsize=trunc(seq(Bmin,Bmax,len=7L)),col=c('red','green','blue','brown','gold','cyan','magenta'),stringsAsFactors=F);
for (ri in seq_len(nrow(Bsize.col))) {
    Bsize <- Bsize.col$Bsize[ri];
    col <- Bsize.col$col[ri];
    lines(x,dnorm(x,Bmean(Bsize),Bsd(Bsize)),col=col,lwd=2);
};
with(Bsize.col,legend(-2.5,1.65,Bsize,col,col,title=expression(bold(Bsize))),cex=0.7);
subEnv <- as.environment(mget(c('Bmin','Bmax')));
text(0.3,1.6,parse(text=paste0('mu == ',deparse(do.call(substitute,c(list(body(Bmean)),subEnv))))),pos=4L);
text(0.3,1.53,parse(text=paste0('sigma == ',deparse(do.call(substitute,c(list(body(Bsd)),subEnv))))),pos=4L);

normal-2-B


Solution

## fixed constants
Ainner <- 207L;
Binner <- 408L;
Amin <- Ainner*2L;
Bmin <- Binner*2L;
Amax <- Ainner*7L;
Bmax <- Binner*6L;
NPmin <- Amin+Bmin;
NPmax <- Amax+Bmax;

## normal mean and sd functions
Amean <- function(Asize) 1.5 + 6*(Asize-Amin)/(Amax-Amin);
Asd <- function(Asize) 1.3*exp(-(1.22*(2*Asize-Amin-Amax)/(Amax-Amin))^2);
Bmean <- function(Bsize) 1.5 + 5*(Bsize-Bmin)/(Bmax-Bmin);
Bsd <- function(Bsize) 1.3*exp(-(1.22*(2*Bsize-Bmin-Bmax)/(Bmax-Bmin))^2);

## primary implementation function
bgoldst <- function(NP,seed=NULL,check=F) {

    if (!is.null(seed)) set.seed(seed);

    ## in order to parameterize the total number of ppl, must consider exactly which constraints impose which limits
    ## the A min 414 and max 1449 are fixed based on the 207 and 408 inner categories
    ## the B min 816 and max 2448 are also fixed for the same reason
    ## the mins cannot be changed by the parameterized total number of ppl
    ## moreover, we should validate that the total number of ppl is sufficient for all inner categories
    ## this requires 414+816 == 1230
    if (NP<NPmin) stop(paste0('insufficient NP=',NP,'.'));
    ## additionally we should validate that the total number of ppl does not exceed the maximum possible that can be handled by the inner categories
    ## this is 1449+2448 == 3897
    if (NP>NPmax) stop(paste0('excessive NP=',NP,'.'));
    ## the A max varies from 1449 down to 414, depending on NP
    ## the B max varies from 2448 down to 816, depending on NP
    ## so what we can do as the first step is calculate the maxes based on NP
    AminCur <- max(Amin,NP-Bmax);
    BminCur <- max(Bmin,NP-Amax);
    AmaxCur <- min(Amax,NP-Bmin);
    BmaxCur <- min(Bmax,NP-Amin);
    ## now we can select a random division from the available space
    Asize <- if (AminCur==AmaxCur) AminCur else sample(AminCur:AmaxCur,1L);
    Bsize <- NP-Asize;

    ## will use carefully designed sliding normal distributions to couple the probability distribution to the constraints
    ## see global functions for formulae

    ## randomly choose inner category sizes for A
    ## we know the exact number of inner categories we need, so choose that many inner category sizes using the normal dist
    AG <- pmin(7L,pmax(2L,as.integer(rnorm(Ainner,Amean(Asize),Asd(Asize)))));
    ## iterate adding/removing one member at a time to get to the required size
    AGsum <- sum(AG);
    if (AGsum>Asize) {
        while (AGsum>Asize) {
            i <- which(AG>2L);
            if (length(i)>1L) i <- sample(i,1L); ## don't let sample()'s inconsistency screw us
            AG[i] <- AG[i]-1L;
            AGsum <- AGsum-1L;
        }; ## end while
    } else if (AGsum<Asize) {
        while (AGsum<Asize) {
            i <- which(AG<7L);
            if (length(i)>1L) i <- sample(i,1L); ## don't let sample()'s inconsistency screw us
            AG[i] <- AG[i]+1L;
            AGsum <- AGsum+1L;
        }; ## end while
    }; ## end if

    ## randomly choose inner category sizes for B
    BG <- pmin(6L,pmax(2L,as.integer(rnorm(Binner,Bmean(Bsize),Bsd(Bsize)))));
    ## iterate adding/removing one member at a time to get to the required size
    BGsum <- sum(BG);
    if (BGsum>Bsize) {
        while (BGsum>Bsize) {
            i <- which(BG>2L);
            if (length(i)>1L) i <- sample(i,1L); ## don't let sample()'s inconsistency screw us
            BG[i] <- BG[i]-1L;
            BGsum <- BGsum-1L;
        }; ## end while
    } else if (BGsum<Bsize) {
        while (BGsum<Bsize) {
            i <- which(BG<6L);
            if (length(i)>1L) i <- sample(i,1L); ## don't let sample()'s inconsistency screw us
            BG[i] <- BG[i]+1L;
            BGsum <- BGsum+1L;
        }; ## end while
    }; ## end if

    ## combine into data.frame, randomly distributing the inner categories across inner category ids
    res <- data.frame(
        outer=rep(c('A','B'),c(Ainner,Binner)),
        inner=c(1:Ainner,1:Binner),
        num=c(sample(AG),sample(BG))
    );

    if (check) bgoldst.check(NP,res,seed);

    res;

}; ## end bgoldst()

## validation check helper function
bgoldst.check <- function(NP,res,seed=NULL) {
    seedStr <- if (is.null(seed)) 'NULL' else as.character(seed);
    ## A
    with(res[res$outer=='A',],{
        if (length(outer)!=Ainner) stop(paste0('outer category A has wrong number of inner categories ',length(outer),'!=',Ainner,' [',seedStr,'].'));
        x <- num>=2L & num<=7L;
        if (!all(x)) stop(paste0('outer category A has invalid inner category size ',num[which(!x)[1L]],' [',seedStr,'].'));
        x <- sum(num);
        if (!(x>=Amin && x<=Amax)) stop(paste0('outer category A has invalid size ',x,' [',seedStr,'].'));
    });
    ## B
    with(res[res$outer=='B',],{
        if (length(outer)!=Binner) stop(paste0('outer category B has wrong number of inner categories ',length(outer),'!=',Binner,' [',seedStr,'].'));
        x <- num>=2L & num<=6L;
        if (!all(x)) stop(paste0('outer category B has invalid inner category size ',num[which(!x)[1L]],' [',seedStr,'].'));
        x <- sum(num);
        if (!(x>=Bmin && x<=Bmax)) stop(paste0('outer category B has invalid size ',x,' [',seedStr,'].'));
    });
    ## all
    with(res,{
        x <- sum(num);
        if (x!=NP) stop(paste0('result has invalid total size ',x,' [',seedStr,'].'));
    });
}; ## end bgoldst.check()

## one-off demo
res <- bgoldst(1524L,1L,T);
head(res,10L); tail(res,10L);
##    outer inner num
## 1      A     1   2
## 2      A     2   3
## 3      A     3   3
## 4      A     4   2
## 5      A     5   2
## 6      A     6   2
## 7      A     7   2
## 8      A     8   4
## 9      A     9   2
## 10     A    10   2
##     outer inner num
## 606     B   399   3
## 607     B   400   2
## 608     B   401   4
## 609     B   402   2
## 610     B   403   2
## 611     B   404   2
## 612     B   405   6
## 613     B   406   2
## 614     B   407   2
## 615     B   408   5
table(res$outer,res$num);
##
##       2   3   4   5   6
##   A 158  28  13   8   0
##   B 282  68  33  18   7

## extensive testing
for (seed in seq_len(1e5L)) {
    print(seed);
    set.seed(seed);
    bgoldst(sample(NPmin:NPmax,1L),NULL,T);
}; ## end for
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
##
## ... snip ... (all succeed, all fast)
##
## [1] 99996
## [1] 99997
## [1] 99998
## [1] 99999
## [1] 100000

First Attempt

(Original intro: Ok, unfortunately, there was some ambiguity in the OP's wording, specifically where he said "it does not matter if a group category is not used". I assumed that meant that inner categories could have zero members. My solution below was based on that premise. My assumption was incorrect, and that changes everything. I will leave my answer the way I wrote it as I work on a new solution.)

## primary implementation function
bgoldst <- function(seed=NULL,check=F) {

    if (!is.null(seed)) set.seed(seed);

    ## divide 1524 into two outer categories -- sample the acceptable divisions
    ## notably, cannot allow only 1 person into either outer category
    ## also, cannot take more than 1449 ppl into A; most it can hold is 7*207 == 1449
    ## B can hold any number from zero to 1524
    NHA <- sample(c(0L,2:1449),1L);
    NHB <- 1524L-NHA;

    ## also, since 1449 would *require* 7 ppl in every category, must prep normal dist params
    ## specifically, will slide mean from 4.5 over towards (and past) 7, sd smaller the closer we are to 1449
    Amean <- 4.5 + 3*NHA/1449;
    Asd <- 1.5*exp(-(NHA/1e3)^1.6);

    ## divide A into 207 inner categories of 2:7 -- iterative sampling until valid
    ## should be very few iterations, since we over-append with high likelihood
    AG <- integer();
    if (NHA>0L) {
        repeat {
            AG <- c(AG,pmin(7L,pmax(2L,as.integer(rnorm(max(1,NHA/3),Amean,Asd)))));
            ## find last inner category
            AGcum <- cumsum(AG);
            AGLastIndex <- which(AGcum>=NHA)[1L];
            if (!is.na(AGLastIndex)) { ## sufficient coverage
                ## also must guard against too lightly allocated to fit within inner category num limit
                if (AGLastIndex>207L) {
                    AG <- integer(); ## hard reset
                } else {
                    break; ## done
                }; ## end if
            }; ## end if
        }; ## end repeat
        ## remove surplus inner categories and surplus in last inner category
        length(AG) <- AGLastIndex;
        AG[AGLastIndex] <- AG[AGLastIndex] - (AGcum[AGLastIndex]-NHA);
        if (AG[AGLastIndex]==1L) { ## special case for last inner category remnant of 1; must even out against previous inner category
            ## also, can't join max inner category size since it would overflow the last group
            ## also, can't take 1 less than previous inner category size since then *it* would be left with 1
            takeOpt <- setdiff(1:if (AG[AGLastIndex-1L]==7L) 5L else AG[AGLastIndex-1L],AG[AGLastIndex-1L]-1L);
            take <- if (length(takeOpt)==1L) takeOpt else sample(takeOpt,1L); ## don't let sample()'s inconsistent behavior screw us
            AG[AGLastIndex-1L] <- AG[AGLastIndex-1L]-take;
            AG[AGLastIndex] <- AG[AGLastIndex]+take;
        }; ## end if
    }; ## end if

    ## divide Bs into 408 inner categories of 2:6 -- iterative sampling until valid
    BG <- integer();
    if (NHB>0L) {
        repeat {
            BG <- c(BG,sample(2:6,max(1,NHB/3),replace=T));
            ## find last inner category
            BGcum <- cumsum(BG);
            BGLastIndex <- which(BGcum>=NHB)[1L];
            if (!is.na(BGLastIndex)) { ## sufficient coverage
                ## also must guard against too lightly allocated to fit within inner category num limit
                if (BGLastIndex>408L) {
                    BG <- integer(); ## hard reset
                } else {
                    break; ## done
                }; ## end if
            }; ## end if
        }; ## end repeat
        ## remove surplus inner categories and surplus in last inner category
        length(BG) <- BGLastIndex;
        BG[BGLastIndex] <- BG[BGLastIndex] - (BGcum[BGLastIndex]-NHB);
        if (BG[BGLastIndex]==1L) { ## special case for last inner category remnant of 1; must even out against previous inner category
            ## also, can't join max inner category size since it would overflow the last group
            ## also, can't take 1 less than previous inner category size since then *it* would be left with 1
            takeOpt <- setdiff(1:if (BG[BGLastIndex-1L]==6L) 4L else BG[BGLastIndex-1L],BG[BGLastIndex-1L]-1L);
            take <- if (length(takeOpt)==1L) takeOpt else sample(takeOpt,1L); ## don't let sample()'s inconsistent behavior screw us
            BG[BGLastIndex-1L] <- BG[BGLastIndex-1L]-take;
            BG[BGLastIndex] <- BG[BGLastIndex]+take;
        }; ## end if
    }; ## end if

    ## combine into data.frame, randomly distributing the inner categories across inner category ids
    res <- data.frame(
        outer=rep(c('A','B'),c(207L,408L)),
        inner=c(1:207,1:408),
        num=c(sample(c(AG,rep(0L,207L-length(AG)))),sample(c(BG,rep(0L,408L-length(BG)))))
    );

    if (check) bgoldst.check(res,seed);

    res;

}; ## end bgoldst()

## validation check helper function
bgoldst.check <- function(res,seed=NULL) {
    seedStr <- if (is.null(seed)) 'NULL' else as.character(seed);
    ## A
    with(res[res$outer=='A',],{
        if (length(outer)!=207L) stop(paste0('outer category A has wrong number of inner categories ',length(outer),'!=207 [',seedStr,'].'));
        x <- num>=2L & num<=7L | num==0L;
        if (!all(x)) stop(paste0('outer category A has invalid inner category size ',num[which(!x)[1L]],' [',seedStr,'].'));
        x <- sum(num);
        if (!(x>=0L && x<=1524L)) stop(paste0('outer category A has invalid size ',x,' [',seedStr,'].'));
    });
    ## B
    with(res[res$outer=='B',],{
        if (length(outer)!=408L) stop(paste0('outer category B has wrong number of inner categories ',length(outer),'!=408 [',seedStr,'].'));
        x <- num>=2L & num<=6L | num==0L;
        if (!all(x)) stop(paste0('outer category B has invalid inner category size ',num[which(!x)[1L]],' [',seedStr,'].'));
        x <- sum(num);
        if (!(x>=0L && x<=1524L)) stop(paste0('outer category B has invalid size ',x,' [',seedStr,'].'));
    });
    ## all
    with(res,{
        x <- sum(num);
        if (x!=1524L) stop(paste0('result has invalid total size ',x,' [',seedStr,'].'));
    });
}; ## end bgoldst.check()

## one-off demo
res <- bgoldst(1L,T);
head(res,10L); tail(res,10L);
##    outer inner num
## 1      A     1   5
## 2      A     2   4
## 3      A     3   0
## 4      A     4   0
## 5      A     5   0
## 6      A     6   5
## 7      A     7   0
## 8      A     8   5
## 9      A     9   0
## 10     A    10   4
##     outer inner num
## 606     B   399   3
## 607     B   400   5
## 608     B   401   5
## 609     B   402   0
## 610     B   403   6
## 611     B   404   0
## 612     B   405   5
## 613     B   406   2
## 614     B   407   0
## 615     B   408   0
table(res$outer,res$num);
##
##       0   2   3   4   5   6   7
##   A 125   1   9  25  29  15   3
##   B 116  71  57  54  50  60   0

## extensive testing
for (seed in seq_len(1e5L)) {
    print(seed);
    bgoldst(seed,T);
}; ## end for
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
##
## ... snip ... (all succeed, all fast)
##
## [1] 99996
## [1] 99997
## [1] 99998
## [1] 99999
## [1] 100000

Normal distribution behavior:

xlim <- c(-3,10);
ylim <- c(0,1.7);
xticks <- seq(xlim[1L],xlim[2L]);
yticks <- seq(ylim[1L],ylim[2L],0.1);
plot(NA,xlim=xlim,ylim=ylim,xlab='AG',ylab='P',axes=F);
axis(1L,xticks,xticks);
axis(2L);
box();
abline(v=xticks,col='lightgrey');
abline(h=yticks,col='lightgrey');
x <- seq(xlim[1L],xlim[2L],0.01);
Amean <- function(NHA) 4.5 + 3*NHA/1449;
Asd <- function(NHA) 1.5*exp(-(NHA/1e3)^1.6);
NHA.col <- data.frame(NHA=c(0,300,600,900,1200,1449),col=c('red','green','blue','gold','cyan','magenta'),stringsAsFactors=F);
for (ri in seq_len(nrow(NHA.col))) {
    NHA <- NHA.col$NHA[ri];
    col <- NHA.col$col[ri];
    lines(x,dnorm(x,Amean(NHA),Asd(NHA)),col=col,lwd=2);
};
with(NHA.col,legend(-2.5,1.65,NHA,col,col,title=expression(bold(NHA))),cex=0.7);
text(-2.5,0.92,parse(text=paste0('mu == ',deparse(body(Amean)))),pos=4L);
text(-2.5,0.87,parse(text=paste0('sigma == ',deparse(body(Asd)))),pos=4L);

normal-1

like image 22
bgoldst Avatar answered Oct 06 '22 06:10

bgoldst