Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Load dataset from "R" package using data(), assign it directly to a variable?

Tags:

dataframe

r

How do you load a dataset from an R package using the data() function, and assign it directly to a variable without creating a duplicate copy in your environment?

Put simply, can you do this without creating two identical dfs in your environment:

> data("faithful") # Old Faithful Geyser Data from datasets package

> x <- faithful 

> ls() # Now I have 2 identical dfs - x and faithful - in my environment
[1] "faithful" "x" 

> remove(faithful) # Now I've removed one of the redundant dfs

Try 1:

My first approach was to just assign data("faithful") to x. But data() returns a string. So now I have the df faithful and the character vector x in my environment.

> x <- data("faithful")
> x
[1] "faithful" # String, not the df "faithful" from the datasets package

> ls()
[1] "faithful" "x"  

Try 2: Tried to get a little more sophisticated in my second attempt.

> x <- get(data("faithful")) # This works as far as assignment goes

> ls() # However I still get the duplicate copy
[1] "faithful" "x"

A short note about my motivation for trying to do this. I have an R package with 5 very large data.frames - each having the same columns. I want to efficiently generate the same calculated columns on all 5 data.frames. So I want to use data() within a list() constructor to get the 5 data.frames into a list. Then I want to use llply() and mutate() from the plyr package to iterate over the dfs in the list and create the calculated columns for each df. But I don't want to have duplicate copies of the 5 large datasets sitting in my environment as this is within a Shiny App with a RAM limit.


edit: I was able to use both of @henfiber's methods from his answer to figure out how to lazy-load entire data.frames into a named list.

The first command here works for assigning a data.frame to a new variable name.

# this loads faithful into a variable x. 
# Note we don't need to use the data() function to load faithful
> delayedAssign("x",faithful) 

But I wanted to create a named list x with elements y = data(faithful), z=data(iris), etc.

I tried the below and it didn't work.

> x <- list(delayedAssign("y",faithful),delayedAssign("z", iris))
> ls()
[1] "x" "y" "z" # x is a list with 2 nulls, y & z are promises to faithful & iris

But I finally was able to construct a list of lazy-loaded data.frame objects in the following manner:

# define this function provided by henfiber
getdata <- function(...)
{
e <- new.env()
name <- data(..., envir = e)[1]
e[[name]]
}

# now create your list, this gives you one object "x" of class list
# with elements "y" and "z" which are your data.frames
x <- list(y=getdata(faithful),z=getdata(iris))
like image 869
aashanand Avatar asked Jun 20 '15 06:06

aashanand


People also ask

How do you assign a data set to a variable in R?

In the R Commander, you can click the Data set button to select a data set, and then click the Edit data set button. For more advanced data manipulation in R Commander, explore the Data menu, particularly the Data / Active data set and Data / Manage variables in active data set menus.

How do I load a dataset package in R?

The default R datasets included in the base R distribution Simply check the checkbox next to the package name to load the package and gain access to the datasets. You can also click on the package name and RStudio will open a help file describing the datasets in this package.

What does data () do in RStudio?

data() returns a list of currently loaded datasets or loads a dataset.

What function is used to load a package in R?

There are basically two extremely important functions when it comes down to R packages: install. packages() , which as you can expect, installs a given package. library() which loads packages, i.e. attaches them to the search list on your R workspace.


1 Answers

Using a helper function:

# define this function
getdata <- function(...)
{
    e <- new.env()
    name <- data(..., envir = e)[1]
    e[[name]]
}

# now load your data calling getdata()
x <- getdata("faithful")

Or using an anonymous function:

x <- (function(...)get(data(...,envir = new.env())))("faithful")

Lazy evaluation

You should also consider lazy loading your data adding LazyData: true in the DESCRIPTION file of your package.

If you use RStudio, after running data("faithful"), you'll see at the Environment panel that the "faithful" data.frame is called "promise" (another less common name is "thunk") and is greyed out. That means that it is lazily evaluated by R and not still loaded into memory. You can even lazy load the "x" variable with the delayedAssign() function:

data("faithful")              # lazy load "faithful"
delayedAssign("x", faithful)  # lazy assign "x" with a reference to "faithful"
rm(faithful)                  # remove "faithful"

Still nothing has been loaded into memory yet

summary(x)                    # now x has been loaded and evaluated

Learn more about lazy evaluation here.

like image 81
henfiber Avatar answered Oct 01 '22 13:10

henfiber