Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster code in R

FYI: I have edited this significantly since my first edition. This simulation has been reduced from taking 14 hours to 14 minutes.

I am new to programming but I have made a simulation that tries to follow asexual replication in an organism and quantify the differences in chromosome number between the parent and daughter organisms. The simulation runs extremely slow. It takes about 6 hours to complete. I wanted to know what would be the best way to make the simulation run faster.

These digital organisms have x number of chromosomes. Unlike most organisms the chromosomes are all independent of each other so they have equal chance of being transfered into either daughter organism.

In this case, the distribution of chromosomes into a daughter cell follows a binomial distribution with probability of 0.5.

The function sim_repo takes a matrix of digital organisms with a known number of chromosomes and puts them through 12 generations of replication. It duplicates these chromosomes and then uses the rbinom function to randomly generate a number. This number is then assigned to a daughter cell. As no chromosomes are lost during asexual reproduction the other daughter cell receives the remaining chromosomes. This is then repeated for G number of generations. Then a single value is sampled from each row in the matrix.

 sim_repo = function( x1, G=12, k=1, t=25, h=1000 ) {

            # x1 is the list of copy numbers for a somatic chromosome
            # G is the number of generations, default is 12
            # k is the transfer size, default is 1
            # t is the number of transfers, default is 25
            # h is the number of times to replicate, default is 1000

            dup <- x1 * 2 # duplicate the initial somatic chromosome copy number for replication
            pop <- 1 # set generation time
            set.seed(11)
            z <- matrix(rbinom(n=rep(1,length(dup)),size = as.vector(dup),prob = 0.5),nrow = nrow(dup)) # amount of somatic chromosome is distributed to one of the daughter cells
            z1 <- dup - z # as no somatic chromosomes are lost, the other daughter cells receives the remainder somatic chromosomes
            x1 <- cbind(z, z1) # put both in a matrix

            for ( pop in 1:G ) { # this loop does the replication for each cell in each generation
                pop <- 1 + pop # number of generations.  This is a count for the for loop
                dup <- x1 * 2 # double the somatic chromosomes for replication
                set.seed(11)
                z <- matrix(rbinom(n=rep(1,length(dup)),size = as.vector(dup),prob = 0.5),nrow = nrow(dup)) # amount of somatic c hromosomes distributed to one of the daughter cells
                z1 <- dup - z # as no somatic chromosomes are lost, the other daughter cells receives the remainder somatic chromosomes
                x1 <- cbind(z, z1) # put both in a matrix
                }

            # the following for loop randomly selects one cell in the population that was created
            # the output is a matrix of 1 column
            x1 <- matrix(apply(x1, 1, sample, size=k), ncol=1)
            x1
    }

I my research I am interested in the change in variance in the chromosomes of the initial ancestral organisms and the final time point in this simulation. The following function represents transferring a cell into a new living environment. It takes the output from the function sim_rep and uses that to generate more generations. It then finds the variance among the rows in the first and last matrix columns and finds the difference between them.

    # The following function is mostly the same as I talked about in the description.
    # The only difference is I changed some aspects to take into account I am using
    # matrices and not lists.
    # The function outputs the difference between the intial variance component between
    # 'cell lines' with the final variance after t number of transfers

sim_exp = function( x1, G=12, k=1, t=25, h=1000 ) {

    xn <- matrix(NA, nrow(x1), t)  
    x <- x1
    xn[,1] <- x1
    for ( l in 2:t ) {
        x <- sim_repo( x, G, k, t, h )
        xn[, l] <- x
    }

    colvar <- matrix(apply(xn,2,var),ncol=ncol(xn))
    ivar <- colvar[,1]
    fvar <- colvar[,ncol(xn)]
    deltavar <- fvar - ivar
    deltavar
}  

I need to replicate this simulation h number of times. Thus I made the following function that will call the function sim_exp h number of times.

sim_1000 = function( x1, G=12, k=1, t=25, h=1000 ) {
    xn <- vector(length=h)
    for ( l in 2:h ) {
        x <- sim_exp( x1, G, k, t, h )
        xn[l] <- x
    }
        xn
}

When I call the sim_exp function with a value like 6 values it takes about 52 seconds to complete.

 x1 <- matrix(data=c(100,100,100,100,100,100),ncol=1)
 system.time(sim_1000(x1,h=1))
   user  system elapsed 
  1.280   0.105   1.369 

If I can get it faster then I could complete more of these simulations and apply a selection model on the simulation.

My input will look like the following x1, a matrix with each ancestral organism on its own row:

x1 <- matrix(data=c(100,100,100,100,100,100),ncol=1) # a matrix of 6 organisms

When I run:

a <- sim_repo(x1, G=12, k=1)

My expected output will be:

 a
     [,1]
[1,]  137
[2,]   82
[3,]   89
[4,]  135
[5,]   89
[6,]  109

 system.time(sim_repo(x1))
   user  system elapsed 
  1.969   0.059   2.010 

When I call the sim_exp function,

b <- sim_exp(x1, G=12, k=1, t=25)

it calls the sim_repo function G times and outputs:

 b
[1] 18805.47

When I call the sim_1000 function, I will normally set h to 1000, but here I will set it at 2. So here sim_1000 will call sim_exp and replicate it 2 times.

c <- sim_1000(x1, G=12, k=1, t=25, h=2)
c
[1] 18805.47 18805.47
like image 913
Kevin Avatar asked Mar 07 '12 01:03

Kevin


1 Answers

As mentioned by others in the comments, if we look only at the function sim_repo, and replace the line:

dup <- apply(x1, c(1,2),"*",2)

with

dup <- x1 * 2

the lines

z <- apply(dup,c(1,2),rbinom,n=1,prob=0.5)

with

z <- matrix(rbinom(n=rep(1,length(dup)),size = as.vector(dup),prob = 0.5),nrow = nrow(dup))

and the inner for loop with

x1 <- matrix(apply(x1,1,sample,size = 1), ncol=1)

I get a, well, large speed increase:

system.time(sim_exp(x1))
   user  system elapsed 
  0.655   0.017   0.686 
> system.time(sim_expOld(x1))
   user  system elapsed 
 21.445   0.128  21.530 

And I verified that it's doing the same thing:

set.seed(123)
out1 <- sim_exp(x1)

set.seed(123)
out2 <- sim_expOld(x1)

all.equal(out1,out2)
> TRUE

And that's not even delving into the pre-allocation, which might actually be hard without completely redesigning things, given the way you've structured your code.

And that's also not even beginning to look at whether you really need all three functions...

like image 132
joran Avatar answered Sep 22 '22 16:09

joran