Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data dictionary packing in R

Tags:

r

I am thinking of writing a data dictionary function in R which, taking a data frame as an argument, will do the following:

1) Create a text file which:

a. Summarises the data frame by listing the number of variables by class, number of observations, number of complete observations … etc

b. For each variable, summarise the key facts about that variable: mean, min, max, mode, number of missing observations … etc

2) Creates a pdf containing a histogram for each numeric or integer variable and a bar chart for each attribute variable.

The basic idea is to create a data dictionary of a data frame with one function.

My question is: is there a package which already does this? And if not, do people think this would be a useful function? Thanks

like image 689
Ross Farrelly Avatar asked Oct 08 '11 08:10

Ross Farrelly


1 Answers

There are a variety of describe functions in various packages. The one I am most familiar with is Hmisc::describe. Here's its description from its help page:

" This function determines whether the variable is character, factor, category, binary, discrete numeric, and continuous numeric, and prints a concise statistical summary according to each. A numeric variable is deemed discrete if it has <= 10 unique values. In this case, quantiles are not printed. A frequency table is printed for any non-binary variable if it has no more than 20 unique values. For any variable with at least 20 unique values, the 5 lowest and highest values are printed."

And an example of the output:

Hmisc::describe(work2[, c("CHOLEST","HDL")])
work2[, c("CHOLEST", "HDL")] 

 2  Variables      5325006  Observations
----------------------------------------------------------------------------------
CHOLEST 
      n missing  unique    Mean     .05     .10     .25     .50     .75     .90 
4410307  914699     689   199.4     141     152     172     196     223     250 
    .95 
    268 

lowest :    0   10   19   20   31, highest: 1102 1204 1213 1219 1234 
----------------------------------------------------------------------------------
HDL 
      n missing  unique    Mean     .05     .10     .25     .50     .75     .90 
4410298  914708     258    54.2      32      36      43      52      63      75 
    .95 
     83 

lowest : -11.0   0.0   0.2   1.0   2.0, highest: 241.0 243.0 248.0 272.0 275.0 
---------------------------------------------------------------------------------- 

Furthermore, on your point about getting histograms, the Hmisc::latex method for a describe-object will produce histograms interleaved in the output illustrated above. (You do need to have a function LaTeX installation to take advantage of this.) I'm pretty sure you can find an illustration of the output in either Harrell's website or with the Amazon "Look Inside" presentation of his book "Regression Modeling Strategies". The book has a ton of useful material regarding data analysis.

like image 86
IRTFM Avatar answered Sep 29 '22 22:09

IRTFM