Every time I get a new data set the first thing I do is check out the summary statistics. The <code>summary</code> function does a pretty good job, but I'm frequently interested in standard deviations, quantiles with different breakpoints, number of observations, etc. Also, the presentation of <code>summary</code> isn't really the easiest way to digest or what you see in journals (i.e., <code>summary</code> is horizontal instead of vertical). For example, here is what I get from summary with some made up data. <pre class="prettyprint"><code>> library(plyr) > library(reshape2) > my.data <- data.frame(firm = factor(rep(letters[1:5], each = 5)), returns = rnorm(n = 5 * 5), leverage = rep(c(0.3, 0.4, 0.5, 0.6, 0.7), each = 5) + .... [TRUNCATED] > my.summary <- summary(my.data) > my.summary firm returns leverage a:5 Min. :-1.6765 Min. :0.2863 b:5 1st Qu.:-0.6945 1st Qu.:0.3929 c:5 Median :-0.1930 Median :0.5061 d:5 Mean :-0.1159 Mean :0.5009 e:5 3rd Qu.: 0.4323 3rd Qu.:0.6011 Max. : 1.1915 Max. :0.7093 </code></pre> But let's say I really want something more like this. <pre class="prettyprint"><code>> my.manual.summary <- data.frame(mean = c(mean(my.data$returns), mean(my.data$leverage)), median = c(median(my.data$returns), median(my.data$leverage .... [TRUNCATED] > rownames(my.manual.summary) <- c("returns", "leverage") > my.manual.summary mean median sd returns -0.1158633 -0.1929571 0.6996548 leverage 0.5008895 0.5061301 0.1453381 </code></pre> For this small data set (i.e., just a few firm characteristics) this is easy. But I have more or what to do more statistics or more slicing-dicing, it can get tedious. I tried this with <code>reshape2</code> and <code>plyr</code>, but get an error. <pre class="prettyprint"><code>> my.melted.data <- melt(my.data) Using firm as id variables > my.improved.summary <- ddply(my.melted.data[, -1], .(variable), c("mean", "median", "sd"), na.rm = T) Error in proto[[i]] <- fs[[i]](x, ...) : more elements supplied than there are to replace In addition: Warning messages: 1: In mean.default(X[[1L]], ...) : argument is not numeric or logical: returning NA 2: In mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]) : argument is not numeric or logical: returning NA 3: In var(as.vector(x), na.rm = na.rm) : NAs introduced by coercion 4: In mean.default(X[[1L]], ...) : argument is not numeric or logical: returning NA </code></pre> This leaves me with two questions: <ol> <li>What am I doing wrong with <code>ddply</code>?</li> <li>Am I re-inventing the wheel here? Given that this is table 1 in everything I read and write, is there an existing solution that I haven't found?</li> </ol> Thanks!

Try the <code>stat.desc</code> in the <code>pastecs</code> package. You can use it on your data set by calling <code>stat.desc(my.data)</code>. To get the output in the format you desire, you need to (a) transpose the data frame, (b) remove non-numeric variables and (c) only retain the summary statistics columns you require

Making a better summary statistics table with plyr in R

Tags:

r

plyr

Every time I get a new data set the first thing I do is check out the summary statistics. The summary function does a pretty good job, but I'm frequently interested in standard deviations, quantiles with different breakpoints, number of observations, etc. Also, the presentation of summary isn't really the easiest way to digest or what you see in journals (i.e., summary is horizontal instead of vertical).

For example, here is what I get from summary with some made up data.

> library(plyr)
> library(reshape2)
> my.data <- data.frame(firm = factor(rep(letters[1:5], each = 5)), returns = rnorm(n = 5 * 5), leverage = rep(c(0.3, 0.4, 0.5, 0.6, 0.7), each = 5) + .... [TRUNCATED] 
> my.summary <- summary(my.data)
> my.summary
 firm     returns           leverage     
 a:5   Min.   :-1.6765   Min.   :0.2863  
 b:5   1st Qu.:-0.6945   1st Qu.:0.3929  
 c:5   Median :-0.1930   Median :0.5061  
 d:5   Mean   :-0.1159   Mean   :0.5009  
 e:5   3rd Qu.: 0.4323   3rd Qu.:0.6011  
       Max.   : 1.1915   Max.   :0.7093

But let's say I really want something more like this.

> my.manual.summary <- data.frame(mean = c(mean(my.data$returns), mean(my.data$leverage)), median = c(median(my.data$returns), median(my.data$leverage .... [TRUNCATED] 
> rownames(my.manual.summary) <- c("returns", "leverage")
> my.manual.summary
               mean     median        sd
returns  -0.1158633 -0.1929571 0.6996548
leverage  0.5008895  0.5061301 0.1453381

For this small data set (i.e., just a few firm characteristics) this is easy. But I have more or what to do more statistics or more slicing-dicing, it can get tedious.

I tried this with reshape2 and plyr, but get an error.

> my.melted.data <- melt(my.data)
Using firm as id variables
> my.improved.summary <- ddply(my.melted.data[, -1], .(variable), c("mean", "median", "sd"), na.rm = T)
Error in proto[[i]] <- fs[[i]](x, ...) : 
  more elements supplied than there are to replace
In addition: Warning messages:
1: In mean.default(X[[1L]], ...) :
  argument is not numeric or logical: returning NA
2: In mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]) :
  argument is not numeric or logical: returning NA
3: In var(as.vector(x), na.rm = na.rm) : NAs introduced by coercion
4: In mean.default(X[[1L]], ...) :
  argument is not numeric or logical: returning NA

This leaves me with two questions:

What am I doing wrong with ddply?
Am I re-inventing the wheel here? Given that this is table 1 in everything I read and write, is there an existing solution that I haven't found?

Thanks!

528

asked Apr 07 '11 16:04

Richard Herron

1 Answers

Try the stat.desc in the pastecs package. You can use it on your data set by calling stat.desc(my.data). To get the output in the format you desire, you need to (a) transpose the data frame, (b) remove non-numeric variables and (c) only retain the summary statistics columns you require

113

answered Sep 21 '22 07:09

Ramnath

Related questions
                            
                                data.table avoid recycling
                            
                                How to group by in base R
                            
                                Filter the middle row of each group
                            
                                Use select_helpers with dplyr::coalesce
                            
                                Replace column values with column name using dplyr's transmute_all
                            
                                Create a new column based on column that does not yet exist
                            
                                Draw border around certain rows using cowplot and ggplot2
                            
                                How to correctly use group_by() and summarise() in a For loop in R
                            
                                wrap text in knitr::kable table cell using "\n"
                            
                                Error in contrib.url(repos, "source") in R trying to use CRAN without setting a mirror Calls: install.packages -> contrib.url Execution halted
                            
                                How to aggregate categorical data in R?
                            
                                Bind vectors across lists to single list of matrices
                            
                                Is it possible to pass multible variables to the same curly curly?
                            
                                Convert string data into data frame
                            
                                Unnest or unchop dataframe containing lists of different lengths
                            
                                How to fix degree symbol not showing correctly in R on Linux/Fedora 31
                            
                                Pass expression as argument in R Survey package
                            
                                how to define fill colours in ggplot histogram?
                            
                                2-way anova on unbalanced dataset
                            
                                multiply each cell of a data.frame with it's weight

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With