I am writing some code to generate balanced experimental designs for market research, specifically for use in conjoint analysis and maximum difference scaling.
The first step is to generate a Partially Balanced Incompleted Block (PBIB) design. This is straight-forward using the R package AlgDesign
.
For most types of research such a design would be sufficient. However, in market research one wants to control for order effects in each block. This is where I would appreciate some help.
Create test data
# The following code is not essential in understanding the problem,
# but I provide it in case you are curious about the origin of the data itself.
#library(AlgDesign)
#set.seed(12345)
#choices <- 4
#nAttributes <- 7
#blocksize <- 7
#bsize <- rep(choices, blocksize)
#PBIB <- optBlock(~., withinData=factor(1:nAttributes), blocksizes=bsize)
#df <- data.frame(t(array(PBIB$rows, dim=c(choices, blocksize))))
#colnames(df) <- paste("Item", 1:choices, sep="")
#rownames(df) <- paste("Set", 1:nAttributes, sep="")
df <- structure(list(
Item1 = c(1, 2, 1, 3, 1, 1, 2),
Item2 = c(4, 4, 2, 5, 3, 2, 3),
Item3 = c(5, 6, 5, 6, 4, 3, 4),
Item4 = c(7, 7, 6, 7, 6, 7, 5)),
.Names = c("Item1", "Item2", "Item3", "Item4"),
row.names = c("Set1", "Set2", "Set3", "Set4", "Set5", "Set6", "Set7"),
class = "data.frame")
** Define two helper functions
balanceMatrix
calculates the balance of the matrix:
balanceMatrix <- function(x){
t(sapply(unique(unlist(x)), function(i)colSums(x==i)))
}
balanceScore
calculates a metric of 'fit' - lower scores are better, with zero perfect:
balanceScore <- function(x){
sum((1-x)^2)
}
Define a function that resamples the rows at random
findBalance <- function(x, nrepeat=100){
df <- x
minw <- Inf
for (n in 1:nrepeat){
for (i in 1:nrow(x)){df[i,] <- sample(df[i, ])}
w <- balanceMatrix(df)
sumw <- balanceScore(w)
if(sumw < minw){
dfbest <- df
minw <- sumw
}
}
dfbest
}
Main code
The dataframe df
is a balanced design of 7 sets. Each set will display 4 items to the respondent. The numeric values in df
refers to 7 different attributes. For example, in Set1 respondent will be asked to choose his/her preferred option from attributes 1, 3, 4 and 7.
The ordering of items in each set is conceptually not important. Thus an ordering of (1,4,5,7) is identical to (7,5,4,1).
However, to get a fully balanced design, each attribute will appear an equal number of times in each column. This design is there imbalanced, since attribute 1 appears 4 times in column 1:
df
Item1 Item2 Item3 Item4
Set1 1 4 5 7
Set2 2 4 6 7
Set3 1 2 5 6
Set4 3 5 6 7
Set5 1 3 4 6
Set6 1 2 3 7
Set7 2 3 4 5
To try and find a more balanced design, I wrote the function findBalance
. This conducts a random search for better solutions, by randomly sampling across the rows of df
. With 100 repeats it finds the following best solution:
set.seed(12345)
dfbest <- findBalance(df, nrepeat=100)
dfbest
Item1 Item2 Item3 Item4
Set1 7 5 1 4
Set2 6 7 4 2
Set3 2 1 5 6
Set4 5 6 7 3
Set5 3 1 6 4
Set6 7 2 3 1
Set7 4 3 2 5
This appears more balanced, and the calculated balance matrix contains lots of ones. The balance matrix counts the number of times each attribute appears in each column. For example, the following table indicates (in the top left hand cell) that attribute 1 appears twice not at all in column 1, and twice in column 2:
balanceMatrix(dfbest)
Item1 Item2 Item3 Item4
[1,] 0 2 1 1
[2,] 1 1 1 1
[3,] 1 1 1 1
[4,] 1 0 1 2
[5,] 1 1 1 1
[6,] 1 1 1 1
[7,] 2 1 1 0
The balance score for this solution is 6, indicating there are at least six cells unequal to 1:
balanceScore(balanceMatrix(dfbest))
[1] 6
My question
Thank you for following this detailed example. My question is how can I rewrite this search function to be more systematic? I'd like to tell R to:
balanceScore(df)
df
In a completely randomized design, objects or subjects are assigned to groups completely at random. One standard method for assigning subjects to treatment groups is to label each subject, then use a table of random numbers to select from the labelled subjects. This may also be accomplished using a computer.
Four major design types with relevance to user research are experimental, quasi-experimental, correlational and single subject. These research designs proceed from a level of high validity and generalizability to ones with lower validity and generalizability. First, a note on validity.
Randomization as a method of experimental control has been extensively used in human clinical trials and other biological experiments. It prevents the selection bias and insures against the accidental bias. It produces the comparable groups and eliminates the source of bias in treatment assignments.
There are three primary types of experimental design: Pre-experimental research design. True experimental research design. Quasi-experimental research design.
OK, I somehow misunderstood your question. So bye bye Fedorov, hello applied Fedorov.
The following algorithm is based on the second iteration of the Fedorov algorithm :
Optionally, you can restart the procedure after 10 iterations and start from another starting point. In you test case, it turned out that few starting points converged very slowly to 0. The function below found balanced experimental designs with a score of 0 in on average 1.5 seconds on my computer :
> X <- findOptimalDesign(df)
> balanceScore(balanceMatrix(X))
[1] 0
> mean(replicate(20, system.time(X <- findOptimalDesign(df))[3]))
[1] 1.733
So this is the function now (given your original balanceMatrix and balanceScore functions) :
findOptimalDesign <- function(x,iter=4,restart=T){
stopifnot(require(combinat))
# transform rows to list
sets <- unlist(apply(x,1,list),recursive=F)
nsets <- NROW(x)
# C0 contains all possible design points
C0 <- lapply(sets,permn)
n <- gamma(NCOL(x)+1)
# starting point
id <- sample(1:n,nsets)
Sol <- sapply(1:nsets,function(i)C0[[i]][id[i]])
IT <- iter
# other iterations
while(IT > 0){
for(i in 1:nsets){
nn <- 1:n
scores <- sapply(nn,function(p){
tmp <- Sol
tmp[[i]] <- C0[[i]][[p]]
w <- balanceMatrix(do.call(rbind,tmp))
balanceScore(w)
})
idnew <- nn[which.min(scores)]
Sol[[i]] <- C0[[i]][[idnew]]
}
#Check if score is 0
out <- as.data.frame(do.call(rbind,Sol))
score <- balanceScore(balanceMatrix(out))
if (score==0) {break}
IT <- IT - 1
# If asked, restart
if(IT==0 & restart){
id <- sample(1:n,nsets)
Sol <- sapply(1:nsets,function(i)C0[[i]][id[i]])
IT <- iter
}
}
out
}
HTH
EDIT : fixed small bug (it restarted immediately after every round as I forgot to condition on IT). Doing that, it runs a bit faster still.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With