I have a set of large dataframes that look like A and B: <pre class="prettyprint"><code>A <- data.frame(A1=c(1,2,3,4,5),B1=c(6,7,8,9,10),C1=c(11,12,13,14,15 )) A1 B1 C1 1 1 6 11 2 2 7 12 3 3 8 13 4 4 9 14 5 5 10 15 B <- data.frame(A2=c(6,7,7,10,11),B2=c(2,1,3,8,11),C2=c(1,5,16,7,8)) A2 B2 C2 1 6 2 1 2 7 1 5 3 7 3 16 4 10 8 7 5 11 11 8 </code></pre> I would like to create a vector (C) that denotes the Pearson correlation between A1 & A2, B1 & B2, and C1 & C2. In this case, for example, those correlations are: <pre class="prettyprint"><code>[1] 0.95 0.92 0.46 </code></pre>

<code>cor</code> accepts two data.frames: <pre class="prettyprint"><code>A<-data.frame(A1=c(1,2,3,4,5),B1=c(6,7,8,9,10),C1=c(11,12,13,14,15 )) B<-data.frame(A2=c(6,7,7,10,11),B2=c(2,1,3,8,11),C2=c(1,5,16,7,8)) cor(A,B) # A2 B2 C2 # A1 0.9481224 0.9190183 0.459588 # B1 0.9481224 0.9190183 0.459588 # C1 0.9481224 0.9190183 0.459588 diag(cor(A,B)) #[1] 0.9481224 0.9190183 0.4595880 </code></pre> Edit: Here are some benchmarks: <pre class="prettyprint"><code>Unit: microseconds expr min lq median uq max neval diag(cor(A, B)) 230.292 238.4225 243.0115 255.0295 352.955 100 mapply(cor, A, B) 267.076 281.5120 286.8030 299.5260 375.087 100 unlist(Map(cor, A, B)) 250.053 259.1045 264.5635 275.9035 1146.140 100 </code></pre> Edit2: And some better benchmarks using <pre class="prettyprint"><code>set.seed(42) A <- as.data.frame(matrix(rnorm(10*n),ncol=n)) B <- as.data.frame(matrix(rnorm(10*n),ncol=n)) </code></pre> <img src="https://i.stack.imgur.com/ct8Vh.png" alt="enter image description here"> However, I should probably mention that these benchmarks strongly depend on the number of rows. Edit3: Since I was asked for the benchmarking code, here it is. <pre class="prettyprint"><code>b <- sapply(2^(1:12), function(n) { set.seed(42) A <- as.data.frame(matrix(rnorm(10*n),ncol=n)) B <- as.data.frame(matrix(rnorm(10*n),ncol=n)) require(microbenchmark) res <- print(microbenchmark( diag(cor(A,B)), mapply(cor, A, B), unlist(Map(cor,A,B)), times=10 ),unit="us") res$median }) b <- t(b) matplot(x=1:12,log10(b),type="l", ylab="log10(median [µs])", xlab="log2(n)",col=1:3,lty=1) legend("topleft", legend=c("diag(cor(A, B))", "mapply(cor, A, B)", "unlist(Map(cor,A,B))"),lty=1, col=1:3) </code></pre>

You can use friend of apply functions, <code>Map</code>, for that. <pre class="prettyprint"><code>Map(function(x,y) cor(x,y),A,B) $A1 [1] 0.9481224 $B1 [1] 0.9190183 $C1 [1] 0.459588 </code></pre> If you want the output as vector as suggested by @Jilber : <pre class="prettyprint"><code>unlist(Map(function(x,y) cor(x,y),A,B)) A1 B1 C1 0.9481224 0.9190183 0.4595880 </code></pre> Or you can just use: <pre class="prettyprint"><code> unlist(Map(cor,A,B)) A1 B1 C1 0.9481224 0.9190183 0.459588 </code></pre>

Another alternative you can use <code>mapply</code> function <pre class="prettyprint"><code>> mapply(function(x,y) cor(x,y),A,B) A1 B1 C1 0.9481224 0.9190183 0.4595880 </code></pre> Or just <code>mapply(cor, A, B)</code> as suggested by @Aaron.

Does calculating correlation between two dataframes require a loop?

Tags:

dataframe

r

correlation

I have a set of large dataframes that look like A and B:

A <- data.frame(A1=c(1,2,3,4,5),B1=c(6,7,8,9,10),C1=c(11,12,13,14,15 ))

  A1 B1 C1
1  1  6 11
2  2  7 12
3  3  8 13
4  4  9 14
5  5 10 15

B <- data.frame(A2=c(6,7,7,10,11),B2=c(2,1,3,8,11),C2=c(1,5,16,7,8))

  A2 B2 C2
1  6  2  1
2  7  1  5
3  7  3 16
4 10  8  7
5 11 11  8

I would like to create a vector (C) that denotes the Pearson correlation between A1 & A2, B1 & B2, and C1 & C2. In this case, for example, those correlations are:

[1] 0.95 0.92 0.46

467

asked Aug 30 '13 14:08

Joe Ripberger

3 Answers

cor accepts two data.frames:

A<-data.frame(A1=c(1,2,3,4,5),B1=c(6,7,8,9,10),C1=c(11,12,13,14,15 ))

B<-data.frame(A2=c(6,7,7,10,11),B2=c(2,1,3,8,11),C2=c(1,5,16,7,8))

cor(A,B)

#           A2        B2       C2
# A1 0.9481224 0.9190183 0.459588
# B1 0.9481224 0.9190183 0.459588
# C1 0.9481224 0.9190183 0.459588

diag(cor(A,B))
#[1] 0.9481224 0.9190183 0.4595880

Edit:

Here are some benchmarks:

Unit: microseconds
                   expr     min       lq   median       uq      max neval
        diag(cor(A, B)) 230.292 238.4225 243.0115 255.0295  352.955   100
      mapply(cor, A, B) 267.076 281.5120 286.8030 299.5260  375.087   100
 unlist(Map(cor, A, B)) 250.053 259.1045 264.5635 275.9035 1146.140   100

Edit2:

And some better benchmarks using

set.seed(42)
A <- as.data.frame(matrix(rnorm(10*n),ncol=n))
B <- as.data.frame(matrix(rnorm(10*n),ncol=n))

enter image description here

However, I should probably mention that these benchmarks strongly depend on the number of rows.

Edit3: Since I was asked for the benchmarking code, here it is.

b <- sapply(2^(1:12), function(n) {
    set.seed(42)
    A <- as.data.frame(matrix(rnorm(10*n),ncol=n))
    B <- as.data.frame(matrix(rnorm(10*n),ncol=n))

    require(microbenchmark)
    res <- print(microbenchmark(
                   diag(cor(A,B)),
                   mapply(cor, A, B),
                   unlist(Map(cor,A,B)),
                   times=10
                 ),unit="us")
    res$median
})

b <- t(b)

matplot(x=1:12,log10(b),type="l",
        ylab="log10(median [µs])", 
        xlab="log2(n)",col=1:3,lty=1)
legend("topleft", legend=c("diag(cor(A, B))", 
                           "mapply(cor, A, B)",
                           "unlist(Map(cor,A,B))"),lty=1, col=1:3)

answered Oct 21 '22 19:10

Roland

You can use friend of apply functions, Map, for that.

Map(function(x,y) cor(x,y),A,B)
$A1
[1] 0.9481224

$B1
[1] 0.9190183

$C1
[1] 0.459588

If you want the output as vector as suggested by @Jilber :

unlist(Map(function(x,y) cor(x,y),A,B))
       A1        B1        C1 
0.9481224 0.9190183 0.4595880

Or you can just use:

 unlist(Map(cor,A,B))
       A1        B1        C1 
0.9481224 0.9190183 0.459588

answered Oct 21 '22 18:10

Metrics

Another alternative you can use mapply function

> mapply(function(x,y) cor(x,y),A,B)
       A1        B1        C1 
0.9481224 0.9190183 0.4595880

Or just mapply(cor, A, B) as suggested by @Aaron.

answered Oct 21 '22 19:10

Jilber Urbina

Related questions
                            
                                How to Fit Long Text into Ggplot2 facet Titles
                            
                                Axis labels and limits with ggplot scale_x_datetime
                            
                                Can I replace NAs when joining two data frames with dplyr?
                            
                                Convert a column in R data frame to lower case
                            
                                add_column in tibble with variable column name
                            
                                Match vectors in sequence
                            
                                Move a column conveniently
                            
                                R - How to one hot encoding a single column while keep other columns still?
                            
                                How to keep dropping the first value, until the sum of the vector is less than 20?
                            
                                how to remove partial duplicates from a data frame?
                            
                                How to make assertions in R?
                            
                                ggplot2: How to adjust fill colour in a boxplot (and change legend text)?
                            
                                R obtaining rownames date using quantmod
                            
                                How do you draw a line across a multiple-figure environment in R?
                            
                                set upper limit in ggplot to include label greater than the maximum value
                            
                                Block bootstrap from subject list
                            
                                How do I create binned factor variables from a continuous variable, with custom breaks?
                            
                                Subscripts in R when adding other text
                            
                                Coloring points in a pairs plot
                            
                                How to recreate same DocumentTermMatrix with new (test) data

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With