How can I generate by-group summary statistics if my grouping variable is a factor?

Tags:

Suppose I wanted to get some summary statistics on the dataset mtcars (part of base R version 2.12.1). Below, I group the cars according to the number of engine cylinders they have and take the per-group means of the remaining variables in mtcars.

> str(mtcars)
'data.frame': 32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
> ddply(mtcars, .(cyl), mean)
       mpg cyl     disp        hp     drat       wt     qsec        vs        am     gear
1 26.66364   4 105.1364  82.63636 4.070909 2.285727 19.13727 0.9090909 0.7272727 4.090909
2 19.74286   6 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286 0.4285714 3.857143
3 15.10000   8 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000 0.1428571 3.285714
      carb
1 1.545455
2 3.428571
3 3.500000

But, if my grouping variable happens to be a factor things get trickier. ddply() throws a warning for each level of the factor, since one can't take the mean() of a factor.

> mtcars$cyl <- as.factor(mtcars$cyl)
> str(mtcars)
'data.frame': 32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
> ddply(mtcars, .(cyl), mean)
       mpg cyl     disp        hp     drat       wt     qsec        vs        am     gear
1 26.66364  NA 105.1364  82.63636 4.070909 2.285727 19.13727 0.9090909 0.7272727 4.090909
2 19.74286  NA 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286 0.4285714 3.857143
3 15.10000  NA 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000 0.1428571 3.285714
      carb
1 1.545455
2 3.428571
3 3.500000
Warning messages:
1: In mean.default(X[[2L]], ...) :
  argument is not numeric or logical: returning NA
2: In mean.default(X[[2L]], ...) :
  argument is not numeric or logical: returning NA
3: In mean.default(X[[2L]], ...) :
  argument is not numeric or logical: returning NA
>

So, I'm wondering if I'm just going about generating summary statistics the wrong way.

How does one usually generate data structures of by-factor or by-group summary statistics (like means, standard deviations, etc.)? Should I be using something other than ddply()? If I can use ddply(), what can I do to avoid the errors that result when trying to take the mean of my grouping factor?

416

asked Jan 29 '11 03:01

briandk

2 Answers

Use numcolwise(mean): the numcolwise function converts its argument (a function) into a function that operates only on numerical columns (and ignores the categorical/factor columns).

  > ddply(mtcars, .(cyl), numcolwise(mean))

      cyl      mpg     disp        hp     drat       wt     qsec        vs
    1   4 26.66364 105.1364  82.63636 4.070909 2.285727 19.13727 0.9090909
    2   6 19.74286 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286
    3   8 15.10000 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000
             am     gear     carb
    1 0.7272727 4.090909 1.545455
    2 0.4285714 3.857143 3.428571
    3 0.1428571 3.285714 3.500000

174

answered Sep 27 '22 20:09

Prasad Chalasani

Not an answer here, but an observation. This is not an issue of ddply() per se. Look at this. The following both work fine to produce a table of means:

aggregate(mtcars, by=list(mtcars$cyl), mean)
apply(mtcars, 2, function(col) tapply(col, INDEX=mtcars$cyl, FUN=mean))

But after mtcars$cyl <- as.factor(mtcars$cyl) neither of the above work, because R doesn't know how to take the mean of a column of factors. We can avoid it by removing that column ("cyl" is column 2) from the things passed to mean():

aggregate(mtcars[ , -2], by=list(mtcars$cyl), mean)
apply(mtcars[ , -2], 2, function(col) tapply(col, INDEX=mtcars$cyl, FUN=mean))

But that's pretty clunky.

answered Sep 27 '22 21:09

J. Win.

Related questions
                            
                                Plotting one variable both line-only and points-only, depending on value
                            
                                Converting data from wide to long format when id variables are encoded in column header [duplicate]
                            
                                lme4 error: boundary (singular) fit: see ?isSingular
                            
                                What's the preferred means for defining an S3 method in an R package without introducing a dependency?
                            
                                How to connect R conda env to jupyter notebook
                            
                                Problems merging data frames in R [duplicate]
                            
                                Selecting observations within a data frame and reversing their order
                            
                                Combining .SD with renamed variable messes with names of .SD columns
                            
                                Count the new element added and removed from the previous group from a dataframe
                            
                                TypeError: use() got an unexpected keyword argument 'warn' when importing matplotlib
                            
                                r-studio: is there a "strict mode"?
                            
                                R >4.1 syntax: Error: function 'function' not supported in RHS call of a pipe
                            
                                Combining time trend plot with timeline
                            
                                Create group based on fuzzy criteria
                            
                                Best way to integrate R and Flash/Flex
                            
                                Renaming rows and columns in R
                            
                                Efficient calculation of matrix cumulative standard deviation in r
                            
                                Writing a Simple Triplet Matrix to a File?
                            
                                Plot Multiple Imputation Results
                            
                                ggplot2 Labeling a multilayered bar plot

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I generate by-group summary statistics if my grouping variable is a factor?

Tags:

r

reshape

apply

plyr

briandk

People also ask

2 Answers

Prasad Chalasani

J. Win.

Recent Activity

Donate For Us