Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In-place list modification without for loop in R

Tags:

r

I'm wondering whether there is a way to do in-place modification of objects in a list without using a for loop. This would be useful, for example, if the individual objects in the list are large and complex, so that we want to avoid making a temporary copy of the entire object. As an example, consider the following code, which creates a list of three data frames, then calculates the vector of maximums across all three data frames for one column of the data, and then assigns that vector to each original data frame. (Code like this is needed when aligning plots in ggplot2.)

data_list <- lapply(1:3, function(x) data.frame(x=rnorm(10), y=rnorm(10), z=rnorm(10)))

max_x <- do.call(pmax, lapply(data_list, function(d){d$x}))

for( i in 1:length(data_list))
{
  data_list[[i]]$x <- max_x
}

Is there any way to write the final part without a for loop?

Answers to some of the questions I'm getting:

  1. What makes me think a copy would be made? I don't know for sure whether a copy would or would not be made. The actual scenario I'm working with deals with entire ggplot graphs (see e.g. here). Since they are rather large and complex, it's critical that no copy be made.

  2. What's the problem with a for loop? I just would rather iterate directly over a list than have to introduce a counter. I don't like counters.

  3. Why not use data.table? Because I'm actually manipulating ggplot graphs, not data frames. The code provided here is just a simplified example.

like image 905
Claus Wilke Avatar asked Mar 21 '16 17:03

Claus Wilke


1 Answers

Base R data structures are copy-on-modify with sharing. Take your example of a data.frame with three numeric columns. Each data.frame is a length 3 "list" vector, each containing a reference to the numeric vectors of the underlying columns. If we modify/replace the first column, R creates a new length 3 data.frame "list" containing references to the new(ly modified) column and the other two unmodified columns.

Let's take a look using the address function*

set.seed(1)
data_list <- lapply(1:3, function(x) data.frame(x=rnorm(10), y=rnorm(10), z=rnorm(10)))

before <- rapply(data_list,address)

Now you want to replace the first column with

max_x <- do.call(pmax, lapply(data_list, function(d){d$x}))

How you do this doesn't much matter, but here's one way without an explicit loop-with-counter

data_list <- lapply(data_list,`[<-`,"x",value=max_x)    

after <- rapply(data_list,address)

Now compare the addresses before and after. Note that the addresses for the y and z columns have not changed. Furthermore, all "after" x columns have the same address -- the address of max_x!

address(max_x)
[1] "05660600"

cbind(before,after)

  before     after     
x "0565F530" "05660600"
y "0565F400" "0565F400"
z "05660AC0" "05660AC0"
x "05660A28" "05660600"
y "05660990" "05660990"
z "05660860" "05660860"
x "056607C8" "05660600"
y "05660730" "05660730"
z "05660698" "05660698"

This means you don't have to worry as much as you might think about making a change to a large data structure. In general, only the modified piece and the skeleton of the data structure will have to be replaced. In this example, the max_x vector had to be created anyway, so the only overhead is creating a new 3 cell data.frame "list" and populating it with 3 references**. This, however, could start to become inefficient if you are iteratively "banging on" changes or working with subvectors rather than entire columns. These are use cases for data.table that are not applicable to this example.


* The address function used here is exported from the data.table package.

** And, of course, in this example, the 3 cell outer list "list" containing the 3 data.frames themselves.

like image 194
A. Webb Avatar answered Nov 03 '22 06:11

A. Webb