I'm wondering whether there is a way to do in-place modification of objects in a list without using a for
loop. This would be useful, for example, if the individual objects in the list are large and complex, so that we want to avoid making a temporary copy of the entire object. As an example, consider the following code, which creates a list of three data frames, then calculates the vector of maximums across all three data frames for one column of the data, and then assigns that vector to each original data frame. (Code like this is needed when aligning plots in ggplot2.)
data_list <- lapply(1:3, function(x) data.frame(x=rnorm(10), y=rnorm(10), z=rnorm(10)))
max_x <- do.call(pmax, lapply(data_list, function(d){d$x}))
for( i in 1:length(data_list))
{
data_list[[i]]$x <- max_x
}
Is there any way to write the final part without a for
loop?
Answers to some of the questions I'm getting:
What makes me think a copy would be made? I don't know for sure whether a copy would or would not be made. The actual scenario I'm working with deals with entire ggplot graphs (see e.g. here). Since they are rather large and complex, it's critical that no copy be made.
What's the problem with a for
loop? I just would rather iterate directly over a list than have to introduce a counter. I don't like counters.
Why not use data.table
? Because I'm actually manipulating ggplot graphs, not data frames. The code provided here is just a simplified example.
Base R data structures are copy-on-modify with sharing. Take your example of a data.frame with three numeric columns. Each data.frame is a length 3 "list" vector, each containing a reference to the numeric vectors of the underlying columns. If we modify/replace the first column, R creates a new length 3 data.frame "list" containing references to the new(ly modified) column and the other two unmodified columns.
Let's take a look using the address
function*
set.seed(1)
data_list <- lapply(1:3, function(x) data.frame(x=rnorm(10), y=rnorm(10), z=rnorm(10)))
before <- rapply(data_list,address)
Now you want to replace the first column with
max_x <- do.call(pmax, lapply(data_list, function(d){d$x}))
How you do this doesn't much matter, but here's one way without an explicit loop-with-counter
data_list <- lapply(data_list,`[<-`,"x",value=max_x)
after <- rapply(data_list,address)
Now compare the addresses before and after. Note that the addresses for the y
and z
columns have not changed. Furthermore, all "after" x
columns have the same address -- the address of max_x
!
address(max_x)
[1] "05660600"
cbind(before,after)
before after
x "0565F530" "05660600"
y "0565F400" "0565F400"
z "05660AC0" "05660AC0"
x "05660A28" "05660600"
y "05660990" "05660990"
z "05660860" "05660860"
x "056607C8" "05660600"
y "05660730" "05660730"
z "05660698" "05660698"
This means you don't have to worry as much as you might think about making a change to a large data structure. In general, only the modified piece and the skeleton of the data structure will have to be replaced. In this example, the max_x
vector had to be created anyway, so the only overhead is creating a new 3 cell data.frame "list" and populating it with 3 references**. This, however, could start to become inefficient if you are iteratively "banging on" changes or working with subvectors rather than entire columns. These are use cases for data.table
that are not applicable to this example.
* The address
function used here is exported from the data.table
package.
** And, of course, in this example, the 3 cell outer list "list" containing the 3 data.frames themselves.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With