Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Making a better summary statistics table with plyr in R

Tags:

r

plyr

Every time I get a new data set the first thing I do is check out the summary statistics. The summary function does a pretty good job, but I'm frequently interested in standard deviations, quantiles with different breakpoints, number of observations, etc. Also, the presentation of summary isn't really the easiest way to digest or what you see in journals (i.e., summary is horizontal instead of vertical).

For example, here is what I get from summary with some made up data.

> library(plyr)
> library(reshape2)
> my.data <- data.frame(firm = factor(rep(letters[1:5], each = 5)), returns = rnorm(n = 5 * 5), leverage = rep(c(0.3, 0.4, 0.5, 0.6, 0.7), each = 5) + .... [TRUNCATED] 
> my.summary <- summary(my.data)
> my.summary
 firm     returns           leverage     
 a:5   Min.   :-1.6765   Min.   :0.2863  
 b:5   1st Qu.:-0.6945   1st Qu.:0.3929  
 c:5   Median :-0.1930   Median :0.5061  
 d:5   Mean   :-0.1159   Mean   :0.5009  
 e:5   3rd Qu.: 0.4323   3rd Qu.:0.6011  
       Max.   : 1.1915   Max.   :0.7093  

But let's say I really want something more like this.

> my.manual.summary <- data.frame(mean = c(mean(my.data$returns), mean(my.data$leverage)), median = c(median(my.data$returns), median(my.data$leverage .... [TRUNCATED] 
> rownames(my.manual.summary) <- c("returns", "leverage")
> my.manual.summary
               mean     median        sd
returns  -0.1158633 -0.1929571 0.6996548
leverage  0.5008895  0.5061301 0.1453381

For this small data set (i.e., just a few firm characteristics) this is easy. But I have more or what to do more statistics or more slicing-dicing, it can get tedious.

I tried this with reshape2 and plyr, but get an error.

> my.melted.data <- melt(my.data)
Using firm as id variables
> my.improved.summary <- ddply(my.melted.data[, -1], .(variable), c("mean", "median", "sd"), na.rm = T)
Error in proto[[i]] <- fs[[i]](x, ...) : 
  more elements supplied than there are to replace
In addition: Warning messages:
1: In mean.default(X[[1L]], ...) :
  argument is not numeric or logical: returning NA
2: In mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]) :
  argument is not numeric or logical: returning NA
3: In var(as.vector(x), na.rm = na.rm) : NAs introduced by coercion
4: In mean.default(X[[1L]], ...) :
  argument is not numeric or logical: returning NA

This leaves me with two questions:

  1. What am I doing wrong with ddply?
  2. Am I re-inventing the wheel here? Given that this is table 1 in everything I read and write, is there an existing solution that I haven't found?

Thanks!

like image 528
Richard Herron Avatar asked Apr 07 '11 16:04

Richard Herron


People also ask

How do I create a summary data table in R?

The easiest way to create summary tables in R is to use the describe() and describeBy() functions from the psych library.

How do I get summary statistics for a column in R?

Descriptive statistics in R (Method 1): summary statistic is computed using summary() function in R. summary() function is automatically applied to each column. The format of the result depends on the data type of the column. If the column is a numeric variable, mean, median, min, max and quartiles are returned.

Which command gives a graphical summary of dataset in R?

The R command for drawing a scatterplot of two variables is a simple command of the form "plot(x,y)."


1 Answers

Try the stat.desc in the pastecs package. You can use it on your data set by calling stat.desc(my.data). To get the output in the format you desire, you need to (a) transpose the data frame, (b) remove non-numeric variables and (c) only retain the summary statistics columns you require

like image 113
Ramnath Avatar answered Sep 21 '22 07:09

Ramnath