I'm analyzing large sets of data using the following script: <pre class="prettyprint"><code>M <- c_alignment c_check <- function(x){ if (x == c_1) { 1 }else{ 0 } } both_c_check <- function(x){ if (x[res_1] == c_1 && x[res_2] == c_1) { 1 }else{ 0 } } variance_function <- function(x,y){ sqrt(x*(1-x))*sqrt(y*(1-y)) } frames_total <- nrow(M) cols <- ncol(M) c_vector <- apply(M, 2, max) freq_vector <- matrix(nrow = sum(c_vector)) co_freq_matrix <- matrix(nrow = sum(c_vector), ncol = sum(c_vector)) insertion <- 0 res_1_insertion <- 0 for (res_1 in 1:cols){ for (c_1 in 1:conf_vector[res_1]){ res_1_insertion <- res_1_insertion + 1 insertion <- insertion + 1 res_1_subset <- sapply(M[,res_1], c_check) freq_vector[insertion] <- sum(res_1_subset)/frames_total res_2_insertion <- 0 for (res_2 in 1:cols){ if (is.na(co_freq_matrix[res_1_insertion, res_2_insertion + 1])){ for (c_2 in 1:max(c_vector[res_2])){ res_2_insertion <- res_2_insertion + 1 both_res_subset <- apply(M, 1, both_c_check) co_freq_matrix[res_1_insertion, res_2_insertion] <- sum(both_res_subset)/frames_total co_freq_matrix[res_2_insertion, res_1_insertion] <- sum(both_res_subset)/frames_total } } } } } covariance_matrix <- (co_freq_matrix - crossprod(t(freq_vector))) variance_matrix <- matrix(outer(freq_vector, freq_vector, variance_function), ncol = length(freq_vector)) correlation_coefficient_matrix <- covariance_matrix/variance_matrix </code></pre> A model input would be something like this: <pre class="prettyprint"><code>1 2 1 4 3 1 3 4 2 1 2 3 3 3 1 1 1 2 1 2 2 3 4 4 2 </code></pre> What I'm calculating is the binomial covariance for each state found in <code>M[,i]</code> with each state found in <code>M[,j]</code>. Each row is the state found for that trial, and I want to see how the state of the columns co-vary. Clarification: I'm finding the covariance of two multinomial distributions, but I'm doing it through binomial comparisons. The input is a 4200 x 510 matrix, and the c value for each column is about 15 on average. I know <code>for</code> loops are terribly slow in R, but I'm not sure how I can use the <code>apply</code> function here. If anyone has a suggestion as to properly using <code>apply</code> here, I'd really appreciate it. Right now the script takes several hours. Thanks!

I thought of writing a comment, but I have too much to say. First of all, if you think apply goes faster, look at Is R's apply family more than syntactic sugar? . It might be, but it's far from guaranteed. Next, please don't grow matrices as you move through your code, that slows down your code incredibly. preallocate the matrix and fill it up, that can increase your code speed more than a tenfold. You're growing different vectors and matrices through your code, that's insane (forgive me the strong speech) Then, look at the help page of <code>?subset</code> and the warning given there: <blockquote> This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences. </blockquote> Always. Use. Indices. Further, You recalculate the same values over and over again. <code>fre_res_2</code> for example is calculated for every res_2 and state_2 as many times as you have combinations of <code>res_1</code> and <code>state_1</code>. That's just a waste of resources. Get out of your loops what you don't need to recalculate, and save it in matrices you can just access again. Heck, now I'm at it: Please use vectorized functions. Think again and see what you can drag out of the loops : This is what I see as the core of your calculation: <pre class="prettyprint"><code>cov <- (freq_both - (freq_res_1)*(freq_res_2)) / (sqrt(freq_res_1*(1-freq_res_1))*sqrt(freq_res_2*(1-freq_res_2))) </code></pre> As I see it, you can construct a matrix freq_both, freq_res_1 and freq_res_2 and use them as input for that one line. And that will be the whole covariance matrix (don't call it <code>cov</code>, <code>cov</code> is a function). Exit loops. Enter fast code. Given the fact I have no clue what's in c_alignment, I'm not going to rewrite your code for you, but you definitely should get rid of the C way of thinking and start thinking R. Let this be a start: The R Inferno

Make nested loops more efficient?

Tags:

r

r-faq

I'm analyzing large sets of data using the following script:

M <- c_alignment 
c_check <- function(x){
    if (x == c_1) {
        1
    }else{
        0
    }
}
both_c_check <- function(x){
    if (x[res_1] == c_1 && x[res_2] == c_1) {
        1
    }else{
        0
    }
}
variance_function <- function(x,y){
    sqrt(x*(1-x))*sqrt(y*(1-y))
}
frames_total <- nrow(M)
cols <- ncol(M)
c_vector <- apply(M, 2, max)
freq_vector <- matrix(nrow = sum(c_vector))
co_freq_matrix <- matrix(nrow = sum(c_vector), ncol = sum(c_vector))
insertion <- 0
res_1_insertion <- 0
for (res_1 in 1:cols){
    for (c_1 in 1:conf_vector[res_1]){
        res_1_insertion <- res_1_insertion + 1
        insertion <- insertion + 1
        res_1_subset <- sapply(M[,res_1], c_check)
        freq_vector[insertion] <- sum(res_1_subset)/frames_total
        res_2_insertion <- 0
        for (res_2 in 1:cols){
            if (is.na(co_freq_matrix[res_1_insertion, res_2_insertion + 1])){
                for (c_2 in 1:max(c_vector[res_2])){
                    res_2_insertion <- res_2_insertion + 1
                    both_res_subset <- apply(M, 1, both_c_check)
                    co_freq_matrix[res_1_insertion, res_2_insertion] <- sum(both_res_subset)/frames_total
                    co_freq_matrix[res_2_insertion, res_1_insertion] <- sum(both_res_subset)/frames_total
                }
            }
        }
    }
}
covariance_matrix <- (co_freq_matrix - crossprod(t(freq_vector)))
variance_matrix <- matrix(outer(freq_vector, freq_vector, variance_function), ncol = length(freq_vector))
correlation_coefficient_matrix <- covariance_matrix/variance_matrix

A model input would be something like this:

What I'm calculating is the binomial covariance for each state found in M[,i] with each state found in M[,j]. Each row is the state found for that trial, and I want to see how the state of the columns co-vary.

Clarification: I'm finding the covariance of two multinomial distributions, but I'm doing it through binomial comparisons.

The input is a 4200 x 510 matrix, and the c value for each column is about 15 on average. I know for loops are terribly slow in R, but I'm not sure how I can use the apply function here. If anyone has a suggestion as to properly using apply here, I'd really appreciate it. Right now the script takes several hours. Thanks!

768

asked Feb 16 '12 22:02

Michael LeVine

1 Answers

I thought of writing a comment, but I have too much to say.

First of all, if you think apply goes faster, look at Is R's apply family more than syntactic sugar? . It might be, but it's far from guaranteed.

Next, please don't grow matrices as you move through your code, that slows down your code incredibly. preallocate the matrix and fill it up, that can increase your code speed more than a tenfold. You're growing different vectors and matrices through your code, that's insane (forgive me the strong speech)

Then, look at the help page of ?subset and the warning given there:

This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.

Always. Use. Indices.

Further, You recalculate the same values over and over again. fre_res_2 for example is calculated for every res_2 and state_2 as many times as you have combinations of res_1 and state_1. That's just a waste of resources. Get out of your loops what you don't need to recalculate, and save it in matrices you can just access again.

Heck, now I'm at it: Please use vectorized functions. Think again and see what you can drag out of the loops : This is what I see as the core of your calculation:

cov <- (freq_both - (freq_res_1)*(freq_res_2)) /
(sqrt(freq_res_1*(1-freq_res_1))*sqrt(freq_res_2*(1-freq_res_2)))

As I see it, you can construct a matrix freq_both, freq_res_1 and freq_res_2 and use them as input for that one line. And that will be the whole covariance matrix (don't call it cov, cov is a function). Exit loops. Enter fast code.

Given the fact I have no clue what's in c_alignment, I'm not going to rewrite your code for you, but you definitely should get rid of the C way of thinking and start thinking R.

Let this be a start: The R Inferno

answered Nov 18 '22 16:11

Joris Meys

Related questions
                            
                                How to extract number from character string?
                            
                                How to get every nth element from each group in a grouped data frame
                            
                                Using str_extract in R to extract a number before a substring with regex
                            
                                Sumproduct by condition in a data frame in R
                            
                                Remove trailing (last) rows with NAs in all columns
                            
                                Extract rows where value appears in any of multiple columns
                            
                                Alternative to Sapply in dplyr
                            
                                R how to speed up pattern matching using vectors
                            
                                REvolution for R
                            
                                Changing dimnames of matrices and data frames in R
                            
                                can't get syntax highlighting to work with R code in vim
                            
                                How to generate the following sequence without resorting to a loop?
                            
                                Equal frequency discretization in R
                            
                                Why do I get an error when I try to model autocorrelation, even when exactly following this example in Pinheiro and Bates (2009)?
                            
                                data.frame with a column containing a matrix in R
                            
                                How to pivot a table to make columns fro a variable row values in R
                            
                                Remove everything before period [duplicate]
                            
                                Plotting deviations from regression line
                            
                                From timespan (for example "15 min" or "2 sec") to "00:15:00" or "00:00:02"
                            
                                Themathic map/choropleth map of the Netherlands

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With