I have a set of large dataframes that look like A and B:
A <- data.frame(A1=c(1,2,3,4,5),B1=c(6,7,8,9,10),C1=c(11,12,13,14,15 ))
A1 B1 C1
1 1 6 11
2 2 7 12
3 3 8 13
4 4 9 14
5 5 10 15
B <- data.frame(A2=c(6,7,7,10,11),B2=c(2,1,3,8,11),C2=c(1,5,16,7,8))
A2 B2 C2
1 6 2 1
2 7 1 5
3 7 3 16
4 10 8 7
5 11 11 8
I would like to create a vector (C) that denotes the Pearson correlation between A1 & A2, B1 & B2, and C1 & C2. In this case, for example, those correlations are:
[1] 0.95 0.92 0.46
corrwith() is used to compute pairwise correlation between rows or columns of two DataFrame objects. If the shape of two dataframe object is not same then the corresponding correlation value will be a NaN value. Note: The correlation of a variable with itself is 1.
Correlation Between Two Columns of DataFrame You can see the correlation between two columns of pandas DataFrame by using DataFrame. corr() function. The pandas. DataFrame.
The correlation between 2 variables is found with the cor() function. Note that the correlation between variables X and Y is equal to the correlation between variables Y and X so the order of the variables in the cor() function does not matter.
cor
accepts two data.frames:
A<-data.frame(A1=c(1,2,3,4,5),B1=c(6,7,8,9,10),C1=c(11,12,13,14,15 ))
B<-data.frame(A2=c(6,7,7,10,11),B2=c(2,1,3,8,11),C2=c(1,5,16,7,8))
cor(A,B)
# A2 B2 C2
# A1 0.9481224 0.9190183 0.459588
# B1 0.9481224 0.9190183 0.459588
# C1 0.9481224 0.9190183 0.459588
diag(cor(A,B))
#[1] 0.9481224 0.9190183 0.4595880
Edit:
Here are some benchmarks:
Unit: microseconds
expr min lq median uq max neval
diag(cor(A, B)) 230.292 238.4225 243.0115 255.0295 352.955 100
mapply(cor, A, B) 267.076 281.5120 286.8030 299.5260 375.087 100
unlist(Map(cor, A, B)) 250.053 259.1045 264.5635 275.9035 1146.140 100
Edit2:
And some better benchmarks using
set.seed(42)
A <- as.data.frame(matrix(rnorm(10*n),ncol=n))
B <- as.data.frame(matrix(rnorm(10*n),ncol=n))
However, I should probably mention that these benchmarks strongly depend on the number of rows.
Edit3: Since I was asked for the benchmarking code, here it is.
b <- sapply(2^(1:12), function(n) {
set.seed(42)
A <- as.data.frame(matrix(rnorm(10*n),ncol=n))
B <- as.data.frame(matrix(rnorm(10*n),ncol=n))
require(microbenchmark)
res <- print(microbenchmark(
diag(cor(A,B)),
mapply(cor, A, B),
unlist(Map(cor,A,B)),
times=10
),unit="us")
res$median
})
b <- t(b)
matplot(x=1:12,log10(b),type="l",
ylab="log10(median [µs])",
xlab="log2(n)",col=1:3,lty=1)
legend("topleft", legend=c("diag(cor(A, B))",
"mapply(cor, A, B)",
"unlist(Map(cor,A,B))"),lty=1, col=1:3)
You can use friend of apply functions, Map
, for that.
Map(function(x,y) cor(x,y),A,B)
$A1
[1] 0.9481224
$B1
[1] 0.9190183
$C1
[1] 0.459588
If you want the output as vector as suggested by @Jilber :
unlist(Map(function(x,y) cor(x,y),A,B))
A1 B1 C1
0.9481224 0.9190183 0.4595880
Or you can just use:
unlist(Map(cor,A,B))
A1 B1 C1
0.9481224 0.9190183 0.459588
Another alternative you can use mapply
function
> mapply(function(x,y) cor(x,y),A,B)
A1 B1 C1
0.9481224 0.9190183 0.4595880
Or just mapply(cor, A, B)
as suggested by @Aaron.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With