Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Avoid loading data every time in knitr

I am creating a document using knitr and I am finding it tedious to reload the data from disk every time I parse the document while I'm in development. I've subsetted that datafile for development to shorten the load time. I also have knitr cache set to on.

I tried assigning the data to the global environment using <<-, and using exists with where=globalenv(), but that did not work.

Anyone know how to use preloaded data from the environment in knitr or have other ideas to speed up development?

like image 580
Daniel Avatar asked Sep 21 '14 19:09

Daniel


People also ask

How do you not run a chunk of code in R markdown?

If you don't want any code chunks to run you can add eval = FALSE in your setup chunk with knitr::opts_chunk$set() . If you want only some chunks to run you can add eval = FALSE to only the chunk headers of those you don't want to run.

How do I turn off library messages in RMarkdown?

For example, when you library(tidyverse) or library(ggplot2) , you may see some loading messages. Such messages can also be suppressed by the chunk option message = FALSE .

What is the purpose of knitr?

knitr is an engine for dynamic report generation with R. It is a package in the programming language R that enables integration of R code into LaTeX, LyX, HTML, Markdown, AsciiDoc, and reStructuredText documents. The purpose of knitr is to allow reproducible research in R through the means of literate programming.


1 Answers

When a document is knitted, a new environment is created within R, and therefore any settings in the global environment will not be passed to the document. However, this is done intentionally, as accidentally referencing an object in the global environment is an easy thing to break a reproducible analysis, and therefore making a clean session each time means the RMarkdown file runs on its own, regardless of the global environment settings.

If you do have a use case which justifies preloading the data, there are a few things you can do.

Example Data

Firstly I have created a minimal Rmd file as below called "RenderTest.Rmd":

title: "Render"
author: "Michael Harper"
date: "7 November 2017"
output: pdf_document
---

```{r cars}
summary(cars2)
```

In this example, cars2 is a set of data I am referencing to from my global session. Run on its using the "Knit" command in RStudio, this will return the following error:

Error in summary(cars): object 'cars2' not found: ... withCallignHandlers -> withVisible -> eval -> eval -> summary Execution halted

Option 1: Manually Call the render function

The render function from rmarkdown can be called from another R script. This by default does not create a fresh environment for the script to run in, so you can use any parameters already loaded. As an example:

# Build file
library(rmarkdown)

cars2<- cars
render("RenderTest.Rmd")

I would, however, be careful doing this. Firstly, the benefit of using RMarkdown is that it makes reproducibility of the script is incredibly easy. As soon as you start using external scripts, it makes things more complicated to replicate as all the settings are not contained within the file.

Option 2: Save data to an R object

If you have some analysis which takes time to run, you can save the result of the analysis as an R object, and then you can reload the final version of the data into the session. Using my above example:

```{r dataProcess, cache = TRUE}
cars2 <- cars
save(cars2, "carsData.RData") # saves the 'cars2' dataset
```
and then we can just reload the data into the session:

```{r}
load("carsData.RData") # reloads the 'cars2' dataset
```

I prefer this technique. The chunk dataProcess is cached, so is only run if there are changes made to the code. The results are saved to file, which are then loaded by the next chunk. The data still has to be loaded into the session, but you can save the finalised dataset if you need to do any data cleaning.

Option 3: Build the file less frequently

With the updates made to RStudio over the past few years, there is less of a need to continuously rebuild the file. Chunks can be run directly within the file, and the output window viewed. It will potentially save you a lot of time trying to optimise the script, only to save a couple of minutes on compiling (which normally makes a good time to get a hot drink anyway!).

enter image description here

like image 136
Michael Harper Avatar answered Oct 21 '22 08:10

Michael Harper