Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculating correlation in data frame in R

Tags:

dataframe

r

I have a data frame d, it has 3 columns, that are s, n, id and I need to calculate correlation between "s" and "n" based on their "id". Like for eg data frame:

"s"   "n"   "id"
1.6    0.5   2
2.5    0.8   2
4.8    0.7   3
2.6    0.4   3
3.5    0.66  3
1.2    0.1   4
2.5    0.45  4

So, I want to calcualte correlation of 2's, 3's and 4's and return it as a vector like:

cor
0.18 0.45 0.65

My problem is how to choose these id's and calculate correlation and return in the form of a vector.

Thank you

like image 691
Raja Raghudeep Emani Avatar asked Mar 18 '23 14:03

Raja Raghudeep Emani


2 Answers

Here's a dplyr approach:

library(dplyr)
group_by(df, id) %>% summarise(corel = cor(s, n)) %>% .$corel
#[1] 1.000000 0.875128 1.000000
like image 57
talat Avatar answered Mar 20 '23 14:03

talat


tab_split<-split(mydf,mydf$id) # get a list where each element is a subset of your data.frame with the same id

unlist(lapply(tab_split,function(tab) cor(tab[,1],tab[,2]))) # get a vector of correlation coefficients

with the sample you gave :

mydf<-structure(list(s = c(1.6, 2.5, 4.8, 2.6, 3.5, 1.2, 2.5), 
                     n = c(0.5,0.8, 0.7, 0.4, 0.66, 0.1, 0.45), 
                     id = c(2L, 2L, 3L, 3L, 3L, 4L,4L)), 
                .Names = c("s", "n", "id"), 
                class = "data.frame", 
                row.names = c(NA, -7L))

> unlist(lapply(tab_split,function(tab) cor(tab[,1],tab[,2])))
       2        3        4 
1.000000 0.875128 1.000000

NB: if your column names are always "n" and "s", you can also do

unlist(lapply(tab_split,function(tab) cor(tab$s,tab$n)))
like image 38
Cath Avatar answered Mar 20 '23 16:03

Cath