Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Workflow for statistical analysis and report writing

People also ask

What are the 5 basic methods of statistical analysis?

For this analysis, there are five to choose from: mean, standard deviation, regression, hypothesis testing, and sample size determination.

What are the format of writing statistical report?

Statistical reports typically are typed single-spaced, using a font such as Arial or Times New Roman in 12-point size. If you have an assignment sheet that describes the formatting requirements, follow those exactly. You typically want to have 1-inch margins around all sides of your report.

What should a statistical analysis report include?

A good outline is: 1) overview of the problem, 2) your data and modeling approach, 3) the results of your data analysis (plots, numbers, etc), and 4) your substantive conclusions. Describe the problem. What substantive question are you trying to address? This needn't be long, but it should be clear.


I generally break my projects into 4 pieces:

  1. load.R
  2. clean.R
  3. func.R
  4. do.R

load.R: Takes care of loading in all the data required. Typically this is a short file, reading in data from files, URLs and/or ODBC. Depending on the project at this point I'll either write out the workspace using save() or just keep things in memory for the next step.

clean.R: This is where all the ugly stuff lives - taking care of missing values, merging data frames, handling outliers.

func.R: Contains all of the functions needed to perform the actual analysis. source()'ing this file should have no side effects other than loading up the function definitions. This means that you can modify this file and reload it without having to go back an repeat steps 1 & 2 which can take a long time to run for large data sets.

do.R: Calls the functions defined in func.R to perform the analysis and produce charts and tables.

The main motivation for this set up is for working with large data whereby you don't want to have to reload the data each time you make a change to a subsequent step. Also, keeping my code compartmentalized like this means I can come back to a long forgotten project and quickly read load.R and work out what data I need to update, and then look at do.R to work out what analysis was performed.


If you'd like to see some examples, I have a few small (and not so small) data cleaning and analysis projects available online. In most, you'll find a script to download the data, one to clean it up, and a few to do exploration and analysis:

  • Baby names from the social security administration
  • 30+ years of fuel economy data from the EPI
  • A big collection of data about the housing crisis
  • Movie ratings from the IMDB
  • House sale data in the Bay Area

Recently I have started numbering the scripts, so it's completely obvious in which order they should be run. (If I'm feeling really fancy I'll sometimes make it so that the exploration script will call the cleaning script which in turn calls the download script, each doing the minimal work necessary - usually by checking for the presence of output files with file.exists. However, most times this seems like overkill).

I use git for all my projects (a source code management system) so its easy to collaborate with others, see what is changing and easily roll back to previous versions.

If I do a formal report, I usually keep R and latex separate, but I always make sure that I can source my R code to produce all the code and output that I need for the report. For the sorts of reports that I do, I find this easier and cleaner than working with latex.


I agree with the other responders: Sweave is excellent for report writing with R. And rebuilding the report with updated results is as simple as re-calling the Sweave function. It's completely self-contained, including all the analysis, data, etc. And you can version control the whole file.

I use the StatET plugin for Eclipse for developing the reports, and Sweave is integrated (Eclipse recognizes latex formating, etc). On Windows, it's easy to use MikTEX.

I would also add, that you can create beautiful reports with Beamer. Creating a normal report is just as simple. I included an example below that pulls data from Yahoo! and creates a chart and a table (using quantmod). You can build this report like so:

Sweave(file = "test.Rnw")

Here's the Beamer document itself:

% 
\documentclass[compress]{beamer}
\usepackage{Sweave}
\usetheme{PaloAlto} 
\begin{document}

\title{test report}
\author{john doe}
\date{September 3, 2009} 

\maketitle

\begin{frame}[fragile]\frametitle{Page 1: chart}

<<echo=FALSE,fig=TRUE,height=4, width=7>>=
library(quantmod)
getSymbols("PFE", from="2009-06-01")
chartSeries(PFE)
@

\end{frame}


\begin{frame}[fragile]\frametitle{Page 2: table}

<<echo=FALSE,results=tex>>=
library(xtable)
xtable(PFE[1:10,1:4], caption = "PFE")
@

\end{frame}

\end{document}

I just wanted to add, in case anyone missed it, that there's a great post on the learnr blog about creating repetitive reports with Jeffrey Horner's brew package. Matt and Kevin both mentioned brew above. I haven't actually used it much myself.

The entries follows a nice workflow, so it's well worth a read:

  1. Prepare the data.
  2. Prepare the report template.
  3. Produce the report.

Actually producing the report once the first two steps are complete is very simple:

library(tools)
library(brew)
brew("population.brew", "population.tex")
texi2dvi("population.tex", pdf = TRUE)