Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Make nested loops more efficient?

Tags:

r

r-faq

I'm analyzing large sets of data using the following script:

M <- c_alignment 
c_check <- function(x){
    if (x == c_1) {
        1
    }else{
        0
    }
}
both_c_check <- function(x){
    if (x[res_1] == c_1 && x[res_2] == c_1) {
        1
    }else{
        0
    }
}
variance_function <- function(x,y){
    sqrt(x*(1-x))*sqrt(y*(1-y))
}
frames_total <- nrow(M)
cols <- ncol(M)
c_vector <- apply(M, 2, max)
freq_vector <- matrix(nrow = sum(c_vector))
co_freq_matrix <- matrix(nrow = sum(c_vector), ncol = sum(c_vector))
insertion <- 0
res_1_insertion <- 0
for (res_1 in 1:cols){
    for (c_1 in 1:conf_vector[res_1]){
        res_1_insertion <- res_1_insertion + 1
        insertion <- insertion + 1
        res_1_subset <- sapply(M[,res_1], c_check)
        freq_vector[insertion] <- sum(res_1_subset)/frames_total
        res_2_insertion <- 0
        for (res_2 in 1:cols){
            if (is.na(co_freq_matrix[res_1_insertion, res_2_insertion + 1])){
                for (c_2 in 1:max(c_vector[res_2])){
                    res_2_insertion <- res_2_insertion + 1
                    both_res_subset <- apply(M, 1, both_c_check)
                    co_freq_matrix[res_1_insertion, res_2_insertion] <- sum(both_res_subset)/frames_total
                    co_freq_matrix[res_2_insertion, res_1_insertion] <- sum(both_res_subset)/frames_total
                }
            }
        }
    }
}
covariance_matrix <- (co_freq_matrix - crossprod(t(freq_vector)))
variance_matrix <- matrix(outer(freq_vector, freq_vector, variance_function), ncol = length(freq_vector))
correlation_coefficient_matrix <- covariance_matrix/variance_matrix

A model input would be something like this:

1 2 1 4 3
1 3 4 2 1
2 3 3 3 1
1 1 2 1 2
2 3 4 4 2

What I'm calculating is the binomial covariance for each state found in M[,i] with each state found in M[,j]. Each row is the state found for that trial, and I want to see how the state of the columns co-vary.

Clarification: I'm finding the covariance of two multinomial distributions, but I'm doing it through binomial comparisons.

The input is a 4200 x 510 matrix, and the c value for each column is about 15 on average. I know for loops are terribly slow in R, but I'm not sure how I can use the apply function here. If anyone has a suggestion as to properly using apply here, I'd really appreciate it. Right now the script takes several hours. Thanks!

like image 768
Michael LeVine Avatar asked Feb 16 '12 22:02

Michael LeVine


People also ask

Are nested for loops efficient?

Since nested loops perform at the rate of the amount of data input there is squared (O(N²) in Big O notation), it is not the most efficient.

How can I improve my loop performance?

The best ways to improve loop performance are to decrease the amount of work done per iteration and decrease the number of loop iterations. Generally speaking, switch is always faster than if-else , but isn't always the best solution.

Can nested loops cause performance issues?

Nested Loops can greatly reduce the innovative potential of the code because it negatively impacts performance.

How do you make a Python loop more efficient?

A faster way to loop in Python is using built-in functions. In our example, we could replace the for loop with the sum function. This function will sum the values inside the range of numbers.


1 Answers

I thought of writing a comment, but I have too much to say.

First of all, if you think apply goes faster, look at Is R's apply family more than syntactic sugar? . It might be, but it's far from guaranteed.

Next, please don't grow matrices as you move through your code, that slows down your code incredibly. preallocate the matrix and fill it up, that can increase your code speed more than a tenfold. You're growing different vectors and matrices through your code, that's insane (forgive me the strong speech)

Then, look at the help page of ?subset and the warning given there:

This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.

Always. Use. Indices.

Further, You recalculate the same values over and over again. fre_res_2 for example is calculated for every res_2 and state_2 as many times as you have combinations of res_1 and state_1. That's just a waste of resources. Get out of your loops what you don't need to recalculate, and save it in matrices you can just access again.

Heck, now I'm at it: Please use vectorized functions. Think again and see what you can drag out of the loops : This is what I see as the core of your calculation:

cov <- (freq_both - (freq_res_1)*(freq_res_2)) /
(sqrt(freq_res_1*(1-freq_res_1))*sqrt(freq_res_2*(1-freq_res_2))) 

As I see it, you can construct a matrix freq_both, freq_res_1 and freq_res_2 and use them as input for that one line. And that will be the whole covariance matrix (don't call it cov, cov is a function). Exit loops. Enter fast code.

Given the fact I have no clue what's in c_alignment, I'm not going to rewrite your code for you, but you definitely should get rid of the C way of thinking and start thinking R.

Let this be a start: The R Inferno

like image 61
Joris Meys Avatar answered Nov 18 '22 16:11

Joris Meys