Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating a summary statistical table from a data frame

Tags:

I have the following data frame (df) of 29 observations of 5 variables:

    age   height_seca1 height_chad1 height_DL weight_alog1 1   19         1800         1797       180           70 2   19         1682         1670       167           69 3   21         1765         1765       178           80 4   21         1829         1833       181           74 5   21         1706         1705       170          103 6   18         1607         1606       160           76 7   19         1578         1576       156           50 8   19         1577         1575       156           61 9   21         1666         1665       166           52 10  17         1710         1716       172           65 11  28         1616         1619       161           66 12  22         1648         1644       165           58 13  19         1569         1570       155           55 14  19         1779         1777       177           55 15  18         1773         1772       179           70 16  18         1816         1809       181           81 17  19         1766         1765       178           77 18  19         1745         1741       174           76 19  18         1716         1714       170           71 20  21         1785         1783       179           64 21  19         1850         1854       185           71 22  31         1875         1880       188           95 23  26         1877         1877       186          106 24  19         1836         1837       185          100 25  18         1825         1823       182           85 26  19         1755         1754       174           79 27  26         1658         1658       165           69 28  20         1816         1818       183           84 29  18         1755         1755       175           67 

I wish to obtain the mean, standard deviation, median, minimum, maximum and sample size of each of the variables and get an output as a data frame. I tried using the code below but then the it becomes impossible for me to work with and using tapply or aggregate seems to be beyond me as a novice R programmer. My assignment requires me not use any 'extra' R packages.

apply(df, 2, mean) apply(df, 2, sd) apply(df, 2, median) apply(df, 2, min) apply(df, 2, max) apply(df, 2, length) 

Ideally, this is how the output data frame should look like including the row headings for each of the statistical functions:

             age height_seca1 height_chad1    height_DL weight_alog1  mean          20         1737         1736          173           73 sd            3.3         91.9         92.7          9.7         14.5  median        19         1755         1755          175           71 minimum       17         1569         1570          155           50  maximum       31         1877         1880          188          106 sample size   29           29           29           29           29  

Any help would be greatly appreciated.

like image 356
pkg77x7 Avatar asked Jan 08 '14 13:01

pkg77x7


People also ask

How do you get the statistical summary of a data frame?

Summarizing DataThe describe() function computes a summary of statistics pertaining to the DataFrame columns. This function gives the mean, std and IQR values. And, function excludes the character columns and given summary about numeric columns.

What is a summary statistic table?

The summary table is a visualization that summarizes statistical information about data in table form. The information is based on one data table in TIBCO Spotfire. You can, at any time, choose which measures you want to see (such as mean, median, etc.), as well as the columns on which to base these measures.

How do you make a summary table in R?

The easiest way to create summary tables in R is to use the describe() and describeBy() functions from the psych library.


2 Answers

Try with basicStats from fBasics package

> install.packages("fBasics") > library(fBasics) > basicStats(df)                    age height_seca1 height_chad1   height_DL weight_alog1 nobs         29.000000    29.000000    29.000000   29.000000    29.000000 NAs           0.000000     0.000000     0.000000    0.000000     0.000000 Minimum      17.000000  1569.000000  1570.000000  155.000000    50.000000 Maximum      31.000000  1877.000000  1880.000000  188.000000   106.000000 1. Quartile  19.000000  1666.000000  1665.000000  166.000000    65.000000 3. Quartile  21.000000  1816.000000  1809.000000  181.000000    80.000000 Mean         20.413793  1737.241379  1736.482759  173.379310    73.413793 Median       19.000000  1755.000000  1755.000000  175.000000    71.000000 Sum         592.000000 50380.000000 50358.000000 5028.000000  2129.000000 SE Mean       0.612910    17.069018    17.210707    1.798613     2.700354 LCL Mean     19.158305  1702.277081  1701.228224  169.695018    67.882368 UCL Mean     21.669282  1772.205677  1771.737293  177.063602    78.945219 Variance     10.894089  8449.189655  8590.044335   93.815271   211.465517 Stdev         3.300619    91.919474    92.682492    9.685828    14.541854 Skewness      1.746597    -0.355499    -0.322915   -0.430019     0.560360 Kurtosis      2.290686    -1.077820    -1.086108   -1.040182    -0.311017 

You can also subset the output to get what you want:

> basicStats(df)[c("Mean", "Stdev", "Median", "Minimum", "Maximum", "nobs"),]               age height_seca1 height_chad1  height_DL weight_alog1 Mean    20.413793   1737.24138   1736.48276 173.379310     73.41379 Stdev    3.300619     91.91947     92.68249   9.685828     14.54185 Median  19.000000   1755.00000   1755.00000 175.000000     71.00000 Minimum 17.000000   1569.00000   1570.00000 155.000000     50.00000 Maximum 31.000000   1877.00000   1880.00000 188.000000    106.00000 nobs    29.000000     29.00000     29.00000  29.000000     29.00000 

Another alternative is that you define your own function as in this post.

Update:

(I hadn't read the "My assignment requires me not use any 'extra' R packages." part)

As I said before, you can define your own function and loop over each column by using *apply family functions:

my.summary <- function(x,...){   c(mean=mean(x, ...),     sd=sd(x, ...),     median=median(x, ...),     min=min(x, ...),     max=max(x,...),      n=length(x)) }  # all these calls should give you the same results. apply(df, 2, my.summary) sapply(df, my.summary) do.call(cbind,lapply(df, my.summary)) 
like image 80
Jilber Urbina Avatar answered Sep 21 '22 05:09

Jilber Urbina


Or using what you have already done, you just need to put those summaries into a list and use do.call

df <- structure(list(age = c(19L, 19L, 21L, 21L, 21L, 18L, 19L, 19L, 21L, 17L, 28L, 22L, 19L, 19L, 18L, 18L, 19L, 19L, 18L, 21L, 19L, 31L, 26L, 19L, 18L, 19L, 26L, 20L, 18L), height_seca1 = c(1800L, 1682L, 1765L, 1829L, 1706L, 1607L, 1578L, 1577L, 1666L, 1710L, 1616L, 1648L, 1569L, 1779L, 1773L, 1816L, 1766L, 1745L, 1716L, 1785L, 1850L, 1875L, 1877L, 1836L, 1825L, 1755L, 1658L, 1816L, 1755L), height_chad1 = c(1797L, 1670L, 1765L, 1833L, 1705L, 1606L, 1576L, 1575L, 1665L, 1716L, 1619L, 1644L, 1570L, 1777L, 1772L, 1809L, 1765L, 1741L, 1714L, 1783L, 1854L, 1880L, 1877L, 1837L, 1823L, 1754L, 1658L, 1818L, 1755L), height_DL = c(180L, 167L, 178L, 181L, 170L, 160L, 156L, 156L, 166L, 172L, 161L, 165L, 155L, 177L, 179L, 181L, 178L, 174L, 170L, 179L, 185L, 188L, 186L, 185L, 182L, 174L, 165L, 183L, 175L), weight_alog1 = c(70L, 69L, 80L, 74L, 103L, 76L, 50L, 61L, 52L, 65L, 66L, 58L, 55L, 55L, 70L, 81L, 77L, 76L, 71L, 64L, 71L, 95L, 106L, 100L, 85L, 79L, 69L, 84L, 67L)), class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29"))  tmp <- do.call(data.frame,             list(mean = apply(df, 2, mean),                 sd = apply(df, 2, sd),                 median = apply(df, 2, median),                 min = apply(df, 2, min),                 max = apply(df, 2, max),                 n = apply(df, 2, length))) tmp                     mean        sd median  min  max  n age            20.41379  3.300619     19   17   31 29 height_seca1 1737.24138 91.919474   1755 1569 1877 29 height_chad1 1736.48276 92.682492   1755 1570 1880 29 height_DL     173.37931  9.685828    175  155  188 29 weight_alog1   73.41379 14.541854     71   50  106 29 

or...

data.frame(t(tmp))               age height_seca1 height_chad1  height_DL weight_alog1 mean   20.413793   1737.24138   1736.48276 173.379310     73.41379 sd      3.300619     91.91947     92.68249   9.685828     14.54185 median 19.000000   1755.00000   1755.00000 175.000000     71.00000 min    17.000000   1569.00000   1570.00000 155.000000     50.00000 max    31.000000   1877.00000   1880.00000 188.000000    106.00000 n      29.000000     29.00000     29.00000  29.000000     29.00000 
like image 39
rawr Avatar answered Sep 21 '22 05:09

rawr