I am attempting to calculate the correlation between all the rows of a large data frame, and so far have come up with a simple for-loop that works. For example:
name <- c("a", "b", "c", "d")
col1 <- c(43.78, 43.84, 37.92, 31.72)
col2 <- c(43.80, 43.40, 37.64, 31.62)
col3 <- c(43.14, 42.85, 37.54, 31.74)
df <- data.frame(name, col1, col2, col3)
cor.df <- data.frame(name1=NA, name2=NA,correl=NA)
for(i in 1: (nrow(df) - 1)) {
for(j in (i+1): nrow(df) ) {
v1 <- as.numeric( df[i, 2:ncol(df)] )
v2 <- as.numeric( df[j, 2:ncol(df)] )
correl <- cor(v1, v2)
name1 <- df[i, "name"]
name2 <- df[j, "name"]
dftemp <- data.frame(name1, name2, correl)
cor.df <- rbind(cor.df, dftemp)
}
}
na.omit(cor.df)
# name1 name2 correl
# a b 0.8841255
# a c 0.6842705
# a d -0.6491118
# b c 0.9457125
# b d -0.2184630
# c d 0.1105508
Given the large data frame and the inefficient for-loop, the correlation computation takes a long time. Would anyone have any suggestions as to how to make it faster? Note that I have many data frames in a list, so I can use lapply (but have not figured out how to write the line of code)
Now use corr () function to find the correlation among the columns. We are only having four numeric columns in the dataframe. The output dataframe can be interpreted as for any cell, row variable correlation with the column variable is the value of the cell.
Pandas dataframe.corr() is used to find the pairwise correlation of all columns in the dataframe. Any na values are automatically excluded. For any non-numeric data type columns in the dataframe it is ignored.
We are only having four numeric columns in the dataframe. The output dataframe can be interpreted as for any cell, row variable correlation with the column variable is the value of the cell. As mentioned earlier, that the correlation of a variable with itself is 1.
A correlation matrix investigates the dependence between multiple variables at the same time. It shows symmetric tabular data where each row and column represent a variable, and the corresponding value is the correlation coefficient denoting the strength of a relationship between these two variables.
Drop the first column, transpose and use base::cor function:
> cor(t(df[-1]))
[,1] [,2] [,3] [,4]
[1,] 1.0000000 0.8841255 0.6842705 -0.6491118
[2,] 0.8841255 1.0000000 0.9457125 -0.2184630
[3,] 0.6842705 0.9457125 1.0000000 0.1105508
[4,] -0.6491118 -0.2184630 0.1105508 1.0000000
# pretty output
x <- cor(t(df[, -1]))
x[upper.tri(x, diag = TRUE)] <- NA
rownames(x) <- colnames(x) <- df$name
x <- na.omit(reshape::melt(t(x)))
x <- x[ order(x$X1, x$X2), ]
x
# X1 X2 value
# 5 a b 0.8841255
# 9 a c 0.6842705
# 13 a d -0.6491118
# 10 b c 0.9457125
# 14 b d -0.2184630
# 15 c d 0.1105508
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With