Logo Questions Linux Laravel Mysql Ubuntu Git Menu

correlation by row, within data frame

I am attempting to calculate the correlation between all the rows of a large data frame, and so far have come up with a simple for-loop that works. For example:

name <- c("a", "b", "c", "d")
col1 <- c(43.78, 43.84, 37.92, 31.72)
col2 <- c(43.80, 43.40, 37.64, 31.62)
col3 <- c(43.14, 42.85, 37.54, 31.74)
df <- data.frame(name, col1, col2, col3)
cor.df <- data.frame(name1=NA, name2=NA,correl=NA)

for(i in 1: (nrow(df) - 1))  {
  for(j in (i+1): nrow(df) ) {
    v1 <- as.numeric( df[i, 2:ncol(df)] )
    v2 <- as.numeric( df[j, 2:ncol(df)] )
    correl <- cor(v1, v2)

    name1 <- df[i, "name"]
    name2 <- df[j, "name"]

    dftemp <- data.frame(name1, name2, correl)
    cor.df <- rbind(cor.df, dftemp)


#    name1 name2     correl
#     a     b      0.8841255
#     a     c      0.6842705
#     a     d     -0.6491118
#     b     c      0.9457125
#     b     d     -0.2184630
#     c     d      0.1105508

Given the large data frame and the inefficient for-loop, the correlation computation takes a long time. Would anyone have any suggestions as to how to make it faster? Note that I have many data frames in a list, so I can use lapply (but have not figured out how to write the line of code)

like image 475
fragf Avatar asked Oct 30 '17 13:10


People also ask

How to find the correlation among the columns in a Dataframe?

Now use corr () function to find the correlation among the columns. We are only having four numeric columns in the dataframe. The output dataframe can be interpreted as for any cell, row variable correlation with the column variable is the value of the cell.

How to find the pairwise correlation of all columns in pandas?

Pandas dataframe.corr() is used to find the pairwise correlation of all columns in the dataframe. Any na values are automatically excluded. For any non-numeric data type columns in the dataframe it is ignored.

What is the correlation of a variable with itself in Excel?

We are only having four numeric columns in the dataframe. The output dataframe can be interpreted as for any cell, row variable correlation with the column variable is the value of the cell. As mentioned earlier, that the correlation of a variable with itself is 1.

What is a correlation matrix in statistics?

A correlation matrix investigates the dependence between multiple variables at the same time. It shows symmetric tabular data where each row and column represent a variable, and the corresponding value is the correlation coefficient denoting the strength of a relationship between these two variables.

1 Answers

Drop the first column, transpose and use base::cor function:

> cor(t(df[-1]))
           [,1]       [,2]      [,3]       [,4]
[1,]  1.0000000  0.8841255 0.6842705 -0.6491118
[2,]  0.8841255  1.0000000 0.9457125 -0.2184630
[3,]  0.6842705  0.9457125 1.0000000  0.1105508
[4,] -0.6491118 -0.2184630 0.1105508  1.0000000

# pretty output
x <- cor(t(df[, -1]))
x[upper.tri(x, diag = TRUE)] <- NA
rownames(x) <- colnames(x) <- df$name
x <- na.omit(reshape::melt(t(x)))
x <- x[ order(x$X1, x$X2), ]

#    X1 X2      value
# 5   a  b  0.8841255
# 9   a  c  0.6842705
# 13  a  d -0.6491118
# 10  b  c  0.9457125
# 14  b  d -0.2184630
# 15  c  d  0.1105508
like image 107
amrrs Avatar answered Sep 19 '22 17:09
