I am thinking of writing a data dictionary function in R which, taking a data frame as an argument, will do the following:
1) Create a text file which:
a. Summarises the data frame by listing the number of variables by class, number of observations, number of complete observations … etc
b. For each variable, summarise the key facts about that variable: mean, min, max, mode, number of missing observations … etc
2) Creates a pdf containing a histogram for each numeric or integer variable and a bar chart for each attribute variable.
The basic idea is to create a data dictionary of a data frame with one function.
My question is: is there a package which already does this? And if not, do people think this would be a useful function? Thanks
There are a variety of describe
functions in various packages. The one I am most familiar with is Hmisc::describe. Here's its description from its help page:
" This function determines whether the variable is character, factor, category, binary, discrete numeric, and continuous numeric, and prints a concise statistical summary according to each. A numeric variable is deemed discrete if it has <= 10 unique values. In this case, quantiles are not printed. A frequency table is printed for any non-binary variable if it has no more than 20 unique values. For any variable with at least 20 unique values, the 5 lowest and highest values are printed."
And an example of the output:
Hmisc::describe(work2[, c("CHOLEST","HDL")])
work2[, c("CHOLEST", "HDL")]
2 Variables 5325006 Observations
----------------------------------------------------------------------------------
CHOLEST
n missing unique Mean .05 .10 .25 .50 .75 .90
4410307 914699 689 199.4 141 152 172 196 223 250
.95
268
lowest : 0 10 19 20 31, highest: 1102 1204 1213 1219 1234
----------------------------------------------------------------------------------
HDL
n missing unique Mean .05 .10 .25 .50 .75 .90
4410298 914708 258 54.2 32 36 43 52 63 75
.95
83
lowest : -11.0 0.0 0.2 1.0 2.0, highest: 241.0 243.0 248.0 272.0 275.0
----------------------------------------------------------------------------------
Furthermore, on your point about getting histograms, the Hmisc::latex method for a describe-object will produce histograms interleaved in the output illustrated above. (You do need to have a function LaTeX installation to take advantage of this.) I'm pretty sure you can find an illustration of the output in either Harrell's website or with the Amazon "Look Inside" presentation of his book "Regression Modeling Strategies". The book has a ton of useful material regarding data analysis.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With