FYI: I have edited this significantly since my first edition. This simulation has been reduced from taking 14 hours to 14 minutes.
I am new to programming but I have made a simulation that tries to follow asexual replication in an organism and quantify the differences in chromosome number between the parent and daughter organisms. The simulation runs extremely slow. It takes about 6 hours to complete. I wanted to know what would be the best way to make the simulation run faster.
These digital organisms have x number of chromosomes. Unlike most organisms the chromosomes are all independent of each other so they have equal chance of being transfered into either daughter organism.
In this case, the distribution of chromosomes into a daughter cell follows a binomial distribution with probability of 0.5.
The function sim_repo
takes a matrix of digital organisms with a known number of chromosomes and puts them through 12 generations of replication. It duplicates these chromosomes and then uses the rbinom
function to randomly generate a number. This number is then assigned to a daughter cell. As no chromosomes are lost during asexual reproduction the other daughter cell receives the remaining chromosomes. This is then repeated for G number of generations. Then a single value is sampled from each row in the matrix.
sim_repo = function( x1, G=12, k=1, t=25, h=1000 ) {
# x1 is the list of copy numbers for a somatic chromosome
# G is the number of generations, default is 12
# k is the transfer size, default is 1
# t is the number of transfers, default is 25
# h is the number of times to replicate, default is 1000
dup <- x1 * 2 # duplicate the initial somatic chromosome copy number for replication
pop <- 1 # set generation time
set.seed(11)
z <- matrix(rbinom(n=rep(1,length(dup)),size = as.vector(dup),prob = 0.5),nrow = nrow(dup)) # amount of somatic chromosome is distributed to one of the daughter cells
z1 <- dup - z # as no somatic chromosomes are lost, the other daughter cells receives the remainder somatic chromosomes
x1 <- cbind(z, z1) # put both in a matrix
for ( pop in 1:G ) { # this loop does the replication for each cell in each generation
pop <- 1 + pop # number of generations. This is a count for the for loop
dup <- x1 * 2 # double the somatic chromosomes for replication
set.seed(11)
z <- matrix(rbinom(n=rep(1,length(dup)),size = as.vector(dup),prob = 0.5),nrow = nrow(dup)) # amount of somatic c hromosomes distributed to one of the daughter cells
z1 <- dup - z # as no somatic chromosomes are lost, the other daughter cells receives the remainder somatic chromosomes
x1 <- cbind(z, z1) # put both in a matrix
}
# the following for loop randomly selects one cell in the population that was created
# the output is a matrix of 1 column
x1 <- matrix(apply(x1, 1, sample, size=k), ncol=1)
x1
}
I my research I am interested in the change in variance in the chromosomes of the initial ancestral organisms and the final time point in this simulation. The following function represents transferring a cell into a new living environment. It takes the output from the function sim_re
p and uses that to generate more generations. It then finds the variance among the rows in the first and last matrix columns and finds the difference between them.
# The following function is mostly the same as I talked about in the description.
# The only difference is I changed some aspects to take into account I am using
# matrices and not lists.
# The function outputs the difference between the intial variance component between
# 'cell lines' with the final variance after t number of transfers
sim_exp = function( x1, G=12, k=1, t=25, h=1000 ) {
xn <- matrix(NA, nrow(x1), t)
x <- x1
xn[,1] <- x1
for ( l in 2:t ) {
x <- sim_repo( x, G, k, t, h )
xn[, l] <- x
}
colvar <- matrix(apply(xn,2,var),ncol=ncol(xn))
ivar <- colvar[,1]
fvar <- colvar[,ncol(xn)]
deltavar <- fvar - ivar
deltavar
}
I need to replicate this simulation h number of times. Thus I made the following function that will call the function sim_exp
h number of times.
sim_1000 = function( x1, G=12, k=1, t=25, h=1000 ) {
xn <- vector(length=h)
for ( l in 2:h ) {
x <- sim_exp( x1, G, k, t, h )
xn[l] <- x
}
xn
}
When I call the sim_exp function with a value like 6 values it takes about 52 seconds to complete.
x1 <- matrix(data=c(100,100,100,100,100,100),ncol=1)
system.time(sim_1000(x1,h=1))
user system elapsed
1.280 0.105 1.369
If I can get it faster then I could complete more of these simulations and apply a selection model on the simulation.
My input will look like the following x1
, a matrix with each ancestral organism on its own row:
x1 <- matrix(data=c(100,100,100,100,100,100),ncol=1) # a matrix of 6 organisms
When I run:
a <- sim_repo(x1, G=12, k=1)
My expected output will be:
a
[,1]
[1,] 137
[2,] 82
[3,] 89
[4,] 135
[5,] 89
[6,] 109
system.time(sim_repo(x1))
user system elapsed
1.969 0.059 2.010
When I call the sim_exp function,
b <- sim_exp(x1, G=12, k=1, t=25)
it calls the sim_repo function G times and outputs:
b
[1] 18805.47
When I call the sim_1000
function, I will normally set h to 1000, but here I will set it at 2. So here sim_1000 will call sim_exp
and replicate it 2 times.
c <- sim_1000(x1, G=12, k=1, t=25, h=2)
c
[1] 18805.47 18805.47
As mentioned by others in the comments, if we look only at the function sim_repo
, and replace the line:
dup <- apply(x1, c(1,2),"*",2)
with
dup <- x1 * 2
the lines
z <- apply(dup,c(1,2),rbinom,n=1,prob=0.5)
with
z <- matrix(rbinom(n=rep(1,length(dup)),size = as.vector(dup),prob = 0.5),nrow = nrow(dup))
and the inner for loop with
x1 <- matrix(apply(x1,1,sample,size = 1), ncol=1)
I get a, well, large speed increase:
system.time(sim_exp(x1))
user system elapsed
0.655 0.017 0.686
> system.time(sim_expOld(x1))
user system elapsed
21.445 0.128 21.530
And I verified that it's doing the same thing:
set.seed(123)
out1 <- sim_exp(x1)
set.seed(123)
out2 <- sim_expOld(x1)
all.equal(out1,out2)
> TRUE
And that's not even delving into the pre-allocation, which might actually be hard without completely redesigning things, given the way you've structured your code.
And that's also not even beginning to look at whether you really need all three functions...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With