Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Randomizing balanced experimental designs

I am writing some code to generate balanced experimental designs for market research, specifically for use in conjoint analysis and maximum difference scaling.

The first step is to generate a Partially Balanced Incompleted Block (PBIB) design. This is straight-forward using the R package AlgDesign.

For most types of research such a design would be sufficient. However, in market research one wants to control for order effects in each block. This is where I would appreciate some help.

Create test data

# The following code is not essential in understanding the problem, 
# but I provide it in case you are curious about the origin of the data itself.
#library(AlgDesign)
#set.seed(12345)
#choices <- 4
#nAttributes <- 7
#blocksize <- 7
#bsize <- rep(choices, blocksize)
#PBIB <- optBlock(~., withinData=factor(1:nAttributes), blocksizes=bsize)
#df <- data.frame(t(array(PBIB$rows, dim=c(choices, blocksize))))
#colnames(df) <- paste("Item", 1:choices, sep="")
#rownames(df) <- paste("Set", 1:nAttributes, sep="")

df <- structure(list(
  Item1 = c(1, 2, 1, 3, 1, 1, 2), 
  Item2 = c(4, 4, 2, 5, 3, 2, 3), 
  Item3 = c(5, 6, 5, 6, 4, 3, 4), 
  Item4 = c(7, 7, 6, 7, 6, 7, 5)), 
  .Names = c("Item1", "Item2", "Item3", "Item4"), 
  row.names = c("Set1", "Set2", "Set3", "Set4", "Set5", "Set6", "Set7"), 
  class = "data.frame")

** Define two helper functions

balanceMatrix calculates the balance of the matrix:

balanceMatrix <- function(x){
    t(sapply(unique(unlist(x)), function(i)colSums(x==i)))
}

balanceScore calculates a metric of 'fit' - lower scores are better, with zero perfect:

balanceScore <- function(x){
    sum((1-x)^2)
}

Define a function that resamples the rows at random

findBalance <- function(x, nrepeat=100){
    df <- x
    minw <- Inf
    for (n in 1:nrepeat){
        for (i in 1:nrow(x)){df[i,] <- sample(df[i, ])}
        w <- balanceMatrix(df)
        sumw <- balanceScore(w)
        if(sumw < minw){
            dfbest <- df
            minw <- sumw
        }
    }
    dfbest
}

Main code

The dataframe df is a balanced design of 7 sets. Each set will display 4 items to the respondent. The numeric values in df refers to 7 different attributes. For example, in Set1 respondent will be asked to choose his/her preferred option from attributes 1, 3, 4 and 7.

The ordering of items in each set is conceptually not important. Thus an ordering of (1,4,5,7) is identical to (7,5,4,1).

However, to get a fully balanced design, each attribute will appear an equal number of times in each column. This design is there imbalanced, since attribute 1 appears 4 times in column 1:

df

     Item1 Item2 Item3 Item4
Set1     1     4     5     7
Set2     2     4     6     7
Set3     1     2     5     6
Set4     3     5     6     7
Set5     1     3     4     6
Set6     1     2     3     7
Set7     2     3     4     5

To try and find a more balanced design, I wrote the function findBalance. This conducts a random search for better solutions, by randomly sampling across the rows of df. With 100 repeats it finds the following best solution:

set.seed(12345)
dfbest <- findBalance(df, nrepeat=100)
dfbest

     Item1 Item2 Item3 Item4
Set1     7     5     1     4
Set2     6     7     4     2
Set3     2     1     5     6
Set4     5     6     7     3
Set5     3     1     6     4
Set6     7     2     3     1
Set7     4     3     2     5

This appears more balanced, and the calculated balance matrix contains lots of ones. The balance matrix counts the number of times each attribute appears in each column. For example, the following table indicates (in the top left hand cell) that attribute 1 appears twice not at all in column 1, and twice in column 2:

balanceMatrix(dfbest)

     Item1 Item2 Item3 Item4
[1,]     0     2     1     1
[2,]     1     1     1     1
[3,]     1     1     1     1
[4,]     1     0     1     2
[5,]     1     1     1     1
[6,]     1     1     1     1
[7,]     2     1     1     0

The balance score for this solution is 6, indicating there are at least six cells unequal to 1:

balanceScore(balanceMatrix(dfbest))
[1] 6

My question

Thank you for following this detailed example. My question is how can I rewrite this search function to be more systematic? I'd like to tell R to:

  • Minimize balanceScore(df)
  • By changing row order of df
  • Subject to: already fully constrained
like image 848
Andrie Avatar asked Apr 12 '11 13:04

Andrie


People also ask

How do you randomize an experimental design?

In a completely randomized design, objects or subjects are assigned to groups completely at random. One standard method for assigning subjects to treatment groups is to label each subject, then use a table of random numbers to select from the labelled subjects. This may also be accomplished using a computer.

What are the 4 types of experimental design?

Four major design types with relevance to user research are experimental, quasi-experimental, correlational and single subject. These research designs proceed from a level of high validity and generalizability to ones with lower validity and generalizability. First, a note on validity.

Why is it important to randomize in experiments?

Randomization as a method of experimental control has been extensively used in human clinical trials and other biological experiments. It prevents the selection bias and insures against the accidental bias. It produces the comparable groups and eliminates the source of bias in treatment assignments.

What are the 3 types of experimental design?

There are three primary types of experimental design: Pre-experimental research design. True experimental research design. Quasi-experimental research design.


1 Answers

OK, I somehow misunderstood your question. So bye bye Fedorov, hello applied Fedorov.

The following algorithm is based on the second iteration of the Fedorov algorithm :

  1. calculate all possible permutation for every set, and store them in the C0 list
  2. draw a first possible solution from the C0 space (one permutation for every set). This can be the original one, but as I need the indices, I rather just start at random.
  3. calculate the score for every new solution, where the first set is replaced by all permutations.
  4. replace the first set with the permutation giving the lowest score
  5. repeat 3-4 for every other set
  6. repeat 3-5 until score reaches 0 or for n iterations.

Optionally, you can restart the procedure after 10 iterations and start from another starting point. In you test case, it turned out that few starting points converged very slowly to 0. The function below found balanced experimental designs with a score of 0 in on average 1.5 seconds on my computer :

> X <- findOptimalDesign(df)
> balanceScore(balanceMatrix(X))
[1] 0
> mean(replicate(20, system.time(X <- findOptimalDesign(df))[3]))
[1] 1.733

So this is the function now (given your original balanceMatrix and balanceScore functions) :

findOptimalDesign <- function(x,iter=4,restart=T){
    stopifnot(require(combinat))
    # transform rows to list
    sets <- unlist(apply(x,1,list),recursive=F)
    nsets <- NROW(x)
    # C0 contains all possible design points
    C0 <- lapply(sets,permn)
    n <- gamma(NCOL(x)+1)

    # starting point
    id <- sample(1:n,nsets)
    Sol <- sapply(1:nsets,function(i)C0[[i]][id[i]])

    IT <- iter
    # other iterations
    while(IT > 0){
      for(i in 1:nsets){
          nn <- 1:n
          scores <- sapply(nn,function(p){
             tmp <- Sol
             tmp[[i]] <- C0[[i]][[p]]
             w <- balanceMatrix(do.call(rbind,tmp))
             balanceScore(w)
          })
          idnew <- nn[which.min(scores)]
          Sol[[i]] <- C0[[i]][[idnew]]

      }
      #Check if score is 0
      out <- as.data.frame(do.call(rbind,Sol))
      score <- balanceScore(balanceMatrix(out))
      if (score==0) {break}
      IT <- IT - 1

      # If asked, restart
      if(IT==0 & restart){
          id <- sample(1:n,nsets)
          Sol <- sapply(1:nsets,function(i)C0[[i]][id[i]])
          IT <- iter
      }
    }
    out
}

HTH

EDIT : fixed small bug (it restarted immediately after every round as I forgot to condition on IT). Doing that, it runs a bit faster still.

like image 174
Joris Meys Avatar answered Sep 27 '22 16:09

Joris Meys