Suppose I wanted to get some summary statistics on the dataset mtcars
(part of base R version 2.12.1).
Below, I group the cars according to the number of engine cylinders they have and take the per-group means of the remaining variables in mtcars
.
> str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
> ddply(mtcars, .(cyl), mean)
mpg cyl disp hp drat wt qsec vs am gear
1 26.66364 4 105.1364 82.63636 4.070909 2.285727 19.13727 0.9090909 0.7272727 4.090909
2 19.74286 6 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286 0.4285714 3.857143
3 15.10000 8 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000 0.1428571 3.285714
carb
1 1.545455
2 3.428571
3 3.500000
But, if my grouping variable happens to be a factor things get trickier. ddply()
throws a warning for each level of the factor,
since one can't take the mean()
of a factor.
> mtcars$cyl <- as.factor(mtcars$cyl)
> str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
> ddply(mtcars, .(cyl), mean)
mpg cyl disp hp drat wt qsec vs am gear
1 26.66364 NA 105.1364 82.63636 4.070909 2.285727 19.13727 0.9090909 0.7272727 4.090909
2 19.74286 NA 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286 0.4285714 3.857143
3 15.10000 NA 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000 0.1428571 3.285714
carb
1 1.545455
2 3.428571
3 3.500000
Warning messages:
1: In mean.default(X[[2L]], ...) :
argument is not numeric or logical: returning NA
2: In mean.default(X[[2L]], ...) :
argument is not numeric or logical: returning NA
3: In mean.default(X[[2L]], ...) :
argument is not numeric or logical: returning NA
>
So, I'm wondering if I'm just going about generating summary statistics the wrong way.
How does one usually generate data structures of by-factor or by-group summary statistics (like means, standard deviations, etc.)? Should I be using something other than ddply()
? If I can use ddply()
, what can I do to avoid the errors that result when trying to take the mean of my grouping factor?
Descriptive statistics used to analyse data for a single categorical variable include frequencies, percentages, fractions and/or relative frequencies (which are simply frequencies divided by the sample size) obtained from the variable's frequency distribution table.
A grouping variable (also called a coding variable, group variable or by variable) sorts data within data files into categories or groups. It tells a computer system how you've sorted data into groups.
Running the ProcedureClick Data > Split File. Select the option Organize output by groups. Double-click the variable Gender to move it to the Groups Based on field. When you are finished, click OK.
B Grouping Variable: The independent variable. The categories (or groups) of the independent variable will define which samples will be compared in the t test.
Use numcolwise(mean)
: the numcolwise
function converts its argument (a function) into a function that operates only on numerical columns (and ignores the categorical/factor columns).
> ddply(mtcars, .(cyl), numcolwise(mean))
cyl mpg disp hp drat wt qsec vs
1 4 26.66364 105.1364 82.63636 4.070909 2.285727 19.13727 0.9090909
2 6 19.74286 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286
3 8 15.10000 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000
am gear carb
1 0.7272727 4.090909 1.545455
2 0.4285714 3.857143 3.428571
3 0.1428571 3.285714 3.500000
Not an answer here, but an observation. This is not an issue of ddply()
per se. Look at this. The following both work fine to produce a table of means:
aggregate(mtcars, by=list(mtcars$cyl), mean)
apply(mtcars, 2, function(col) tapply(col, INDEX=mtcars$cyl, FUN=mean))
But after mtcars$cyl <- as.factor(mtcars$cyl)
neither of the above work, because R doesn't know how to take the mean of a column of factors. We can avoid it by removing that column ("cyl" is column 2) from the things passed to mean()
:
aggregate(mtcars[ , -2], by=list(mtcars$cyl), mean)
apply(mtcars[ , -2], 2, function(col) tapply(col, INDEX=mtcars$cyl, FUN=mean))
But that's pretty clunky.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With