Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Correlation of subsets of dataframe using aggregate

I have a data frame made by row binding many data frames, each identified with a unique key. I wish to calculate the correlation coefficients for columns in each subset (using the unique key) of the big data frame. For example, using the mtcars data I might want to calculate the correlation between columns hp and wt for each unique value in column cyl. I could do it in a loop

data("mtcars")
for(i in c(4,6,8)){
temp = subset(mtcars,mtcars$cyl==i)
cor(temp$hp,temp$wt)
}

I think aggregate would be better, but this code doesn't work:

data("mtcars")
aggregate(mtcars,by=mycars$cyl,cor)
like image 619
Alex Avatar asked Dec 04 '22 11:12

Alex


1 Answers

You could use

 data("mtcars")
 library(plyr)
 ddply(mtcars, "cyl", function(x) cor(x$hp, x$wt))

This splits the data in mtcars by cyl, applies for each subset x the function cor(x$hp, x$wt) and then aggregates the results for each of the subsets in a data.frame.

I can highly recommend the plyr package. It's one of the packages I use most in R.


Edit: As per request, here a dplyr version. I have to say that I am not a big dplyr user, but the code should be ok.

library(dplyr)
mtcars %>% group_by(cyl) %>% summarise(V1=cor(hp, wt))
like image 57
cryo111 Avatar answered Dec 16 '22 17:12

cryo111