Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Having trouble to use the plyr package and working with lists

Tags:

r

plyr

I'm having trouble to understand the usage of the plyr package. I try to use it to split up dataframes that a stored in a list, apply a function, store the results as dataframes and combine the dataframes again as a list. So given the follwing data:

    #create test dfs
    df1<-data.frame(a=sample(1:50,10),b=sample(1:50,10),c=sample(1:50,10),d=(c("a","b","c","a","a","b","b","a","c","d")))
    df2<-data.frame(a=sample(1:50,9),b=sample(1:50,9),c=sample(1:50,9),d=(c("e","f","g","e","e","f","f","e","g")))
    df3<-data.frame(a=sample(1:50,8),b=sample(1:50,8),c=sample(1:50,8),d=(c("h","i","j","h","h","i","i","h")))

    #make them a list
    list.1<-list(df1=df1,df2=df2,df3=df3)

I would like to calculate the mean of each group defined in d of each dataframe. If I'd use plyr only on one dataframe (to calculate the mean by a specific column by groups) a possibility to use the plyr package would be:

    ddply(df1,.(d),summarise, mean=mean(a))

but how do I apply it on every column within the dataframe and on every dataframe within the list? and how can I reassamble all the data so that in the end I get a list with matrizes cotaining the results? Sorry for this very basic question, but I'm new to R and I have really been trying to solve this for quite some time... thx.

like image 232
Joschi Avatar asked Jan 21 '13 13:01

Joschi


1 Answers

You need to put all the data into one big data.frame:

library(reshape)

big_dataframe = ldply(list.1, function(x) melt(x, id.vars = "d"))
> head(big_dataframe)
  .id d variable value
1 df1 a        a    44                                                      
2 df1 b        a    17                                                      
3 df1 c        a    15                                                      
4 df1 a        a    30                                                      
5 df1 a        a    49                                                      
6 df1 b        a    33

...and then use ddply on it.

res = ddply(big_dataframe, .(.id, d, variable), summarise, mn = mean(value))
> res
   .id d variable       mn
1  df1 a        a 40.00000                                                  
2  df1 a        b 25.25000                                                  
3  df1 a        c 31.25000                                                  
4  df1 b        a 22.66667                                                  
5  df1 b        b 16.00000                                                  
6  df1 b        c 26.00000                                                  
7  df1 c        a  9.00000                                                  
8  df1 c        b 16.50000                                                  
9  df1 c        c 15.00000                                                  
10 df1 d        a 28.00000                                                  
11 df1 d        b 24.00000                                                  
12 df1 d        c 39.00000                                                  
13 df2 e        a 18.50000                                                  
14 df2 e        b 15.50000                                                  
15 df2 e        c 16.50000                                                  
16 df2 f        a 26.33333                                                  
17 df2 f        b 42.00000                                                  
18 df2 f        c 37.00000                                                  
19 df2 g        a 26.50000                                                  
20 df2 g        b 22.00000                                                  
21 df2 g        c 31.00000                                                  
22 df3 h        a 29.25000                                                  
23 df3 h        b 34.25000                                                  
24 df3 h        c 32.00000                                                  
25 df3 i        a 30.33333                                                  
26 df3 i        b 40.00000                                                  
27 df3 i        c 24.33333                                                  
28 df3 j        a 21.00000                                                  
29 df3 j        b  5.00000                                                  
30 df3 j        c 46.00000 

which gives the mean of each variable (a-c), per level of factor d, and per sub-dataframe (df1-df3).

like image 191
Paul Hiemstra Avatar answered Nov 15 '22 08:11

Paul Hiemstra