Every time I get a new data set the first thing I do is check out the summary statistics. The summary
function does a pretty good job, but I'm frequently interested in standard deviations, quantiles with different breakpoints, number of observations, etc. Also, the presentation of summary
isn't really the easiest way to digest or what you see in journals (i.e., summary
is horizontal instead of vertical).
For example, here is what I get from summary with some made up data.
> library(plyr)
> library(reshape2)
> my.data <- data.frame(firm = factor(rep(letters[1:5], each = 5)), returns = rnorm(n = 5 * 5), leverage = rep(c(0.3, 0.4, 0.5, 0.6, 0.7), each = 5) + .... [TRUNCATED]
> my.summary <- summary(my.data)
> my.summary
firm returns leverage
a:5 Min. :-1.6765 Min. :0.2863
b:5 1st Qu.:-0.6945 1st Qu.:0.3929
c:5 Median :-0.1930 Median :0.5061
d:5 Mean :-0.1159 Mean :0.5009
e:5 3rd Qu.: 0.4323 3rd Qu.:0.6011
Max. : 1.1915 Max. :0.7093
But let's say I really want something more like this.
> my.manual.summary <- data.frame(mean = c(mean(my.data$returns), mean(my.data$leverage)), median = c(median(my.data$returns), median(my.data$leverage .... [TRUNCATED]
> rownames(my.manual.summary) <- c("returns", "leverage")
> my.manual.summary
mean median sd
returns -0.1158633 -0.1929571 0.6996548
leverage 0.5008895 0.5061301 0.1453381
For this small data set (i.e., just a few firm characteristics) this is easy. But I have more or what to do more statistics or more slicing-dicing, it can get tedious.
I tried this with reshape2
and plyr
, but get an error.
> my.melted.data <- melt(my.data)
Using firm as id variables
> my.improved.summary <- ddply(my.melted.data[, -1], .(variable), c("mean", "median", "sd"), na.rm = T)
Error in proto[[i]] <- fs[[i]](x, ...) :
more elements supplied than there are to replace
In addition: Warning messages:
1: In mean.default(X[[1L]], ...) :
argument is not numeric or logical: returning NA
2: In mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]) :
argument is not numeric or logical: returning NA
3: In var(as.vector(x), na.rm = na.rm) : NAs introduced by coercion
4: In mean.default(X[[1L]], ...) :
argument is not numeric or logical: returning NA
This leaves me with two questions:
ddply
?Thanks!
The easiest way to create summary tables in R is to use the describe() and describeBy() functions from the psych library.
Descriptive statistics in R (Method 1): summary statistic is computed using summary() function in R. summary() function is automatically applied to each column. The format of the result depends on the data type of the column. If the column is a numeric variable, mean, median, min, max and quartiles are returned.
The R command for drawing a scatterplot of two variables is a simple command of the form "plot(x,y)."
Try the stat.desc
in the pastecs
package. You can use it on your data set by calling stat.desc(my.data)
. To get the output in the format you desire, you need to (a) transpose the data frame, (b) remove non-numeric variables and (c) only retain the summary statistics columns you require
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With