Let’s make a simple dataframe and give it an attribute “foo”:
orig <- data.frame(x1 = 1, x2 = 2)
attr(orig, "foo") <- TRUE
“foo” is there:
attributes(orig)
#> $names
#> [1] "x1" "x2"
#>
#> $class
#> [1] "data.frame"
#>
#> $row.names
#> [1] 1
#>
#> $foo
#> [1] TRUE
But if I reorder the columns, “foo” disappears
new <- orig[, c(2, 1)]
attributes(new)
#> $names
#> [1] "x2" "x1"
#>
#> $class
#> [1] "data.frame"
#>
#> $row.names
#> [1] 1
I could add it back with:
attributes(new) <- utils::modifyList(attributes(orig), attributes(new))
attributes(new)
#> $names
#> [1] "x2" "x1"
#>
#> $class
#> [1] "data.frame"
#>
#> $row.names
#> [1] 1
#>
#> $foo
#> [1] TRUE
But this operation is time consuming. Not in this case because it’s a one-row dataframe, but consider this case with 10,000,000 rows:
orig <- data.frame(x1 = rep(1, 1e7), x2 = rep(2, 1e7))
attr(orig, "foo") <- TRUE
new <- orig[, c(2, 1)]
bench::mark(
test = {
attributes(new) <- utils::modifyList(attributes(orig), attributes(new))
}
)
#> # A tibble: 1 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 test 43.2ms 46.6ms 21.6 38.1MB 14.4
Of course, it doesn't take that much time to make this, but it is much longer than in the first case with one row (which takes only a few microseconds). It seems weird to me that the time needed to add a single attribute to a dataframe increases with the size of the dataframe. Am I missing something? Is there a more efficient way to add a list of "simple" attributes to a large dataframe?
Edit: looking for a solution with base R only
When you create a data frame of n rows without explicitly declaring the row names, the row names are stored as an integer vector of length 2 of the form c(NA, -n).
If you copy the row names attribute from one data frame to another, R evaluates this vector in order to copy it. This should never be done.
Alternatively you could use data.table or tidyverse, both of which keep attributes when a copy is made, avoiding the need to copy anything.
Let's create a data frame with 10 rows.
num_rows <- 10
set.seed(0)
dat <- data.frame(
x_char = sample(letters, num_rows),
x_int = sample(1:10, num_rows)
)
Let's look at how it appears in memory. I use a helper function to create a simplified, tree representation of the output of lobstr::sxp(dat) to show how objects are represented in memory.
library(lobstr)
dat_sxp <- sxp(dat)
get_dat_obj_tree(dat_sxp)
1 dat VECSXP length: 2 mem_addr:0x7
2 ¦--x_char STRSXP length: 10 mem_addr:0x1
3 ¦--x_int INTSXP length: 10 mem_addr:0x2
4 °--_attrib LISTSXP length: 3 mem_addr:0x3
5 ¦--names STRSXP length: 2 mem_addr:0x4
6 ¦--class STRSXP length: 1 mem_addr:0x5
7 °--row.names INTSXP length: 2 mem_addr:0x6
The function replaces the memory addresses with unique integers (i.e. mem_addr:0x1 will remain the address of x_char every time the real address is looked up, unless the memory location of x_char actually changes).
We would expect the data to have length 10. But why are the row.names only length 2? Let's print them:
rownames(dat) # "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
attr(dat, "row.names") # 1 2 3 4 5 6 7 8 9 10
Clearly these are vectors with length 10. You might notice that one is a character vector and one is an integer vector. This led me down a lot of dead-ends, until I found this comment in the R source code:
## As from R 2.4.0, row.names can be either character or integer.
## row.names() will always return character.
## attr(, "row.names") will return either character or integer.
##
## Do not assume that the internal representation is either, since
## 1L:n is stored as the integer vector c(NA, n) to save space (and
## the C-level code to get/set the attribute makes the appropriate
## translations.
This reminded me of something you often see in reproducible examples:
dput(dat)
# structure(list(x_char = c("e", "i", "n", "z", "w", "b", "j",
# "l", "o", "a"), x_int = c(4L, 3L, 6L, 2L, 7L, 10L, 5L, 8L, 9L,
# 1L)), class = "data.frame", row.names = c(NA, -10L))
We see that row names are indeed represented as a vector of length 2, row.names = c(NA, -10L). This is the key to understanding how to avoid the expensive copy operation.
It doesn't. It simply creates a circumstance where you are more likely to copy row names, as attributes are not copied after every operation. R Internals states:
Subsetting (other than by an empty index) generally drops all attributes except names, dim and dimnames which are reset as appropriate.
Let's create a new attribute, foo, and see what happens in memory:
attr(dat, "foo") <- TRUE
Let's look at the internal representation:
dat_foo_sxp <- sxp(dat)
get_dat_obj_tree(dat_foo_sxp)
1 dat VECSXP length: 2 mem_addr:0x7
2 ¦--x_char STRSXP length: 10 mem_addr:0x1
3 ¦--x_int INTSXP length: 10 mem_addr:0x2
4 °--_attrib LISTSXP length: 4 mem_addr:0x3
5 ¦--names STRSXP length: 2 mem_addr:0x4
6 ¦--class STRSXP length: 1 mem_addr:0x5
7 ¦--row.names INTSXP length: 2 mem_addr:0x6
8 °--foo LGLSXP length: 1 mem_addr:0x8
Nothing has truly changed in memory - the attributes class simply has a new node, of type LGLSXP, i.e. a logical vector.
Let's re-order the columns.
new <- dat[, c(2,1)]
Although we have selected all the columns, we are essentially subsetting the data by index. Let's look at the nodes of the object in memory:
new_sxp <- sxp(new)
get_dat_obj_tree(new_sxp, "new")
1 new VECSXP length: 2 mem_addr:0x12
2 ¦--x_int INTSXP length: 10 mem_addr:0x2
3 ¦--x_char STRSXP length: 10 mem_addr:0x1
4 °--_attrib LISTSXP length: 3 mem_addr:0x9
5 ¦--names STRSXP length: 2 mem_addr:0x10
6 ¦--class STRSXP length: 1 mem_addr:0x5
7 °--row.names INTSXP length: 2 mem_addr:0x11
This is broadly what we would expect from a lazily-evaluated copy apart from the row.names, which have not changed but have a new memory address:
integer column is the same.character column is the same.names have a new new location (because it's re-ordered).class has the same address.row.names have a new memory address.Perhaps R could have kept the row.names in the same memory location. After all, we are only subsetting columns, so the number and order of rows is unchanged.
However, and this is why my previous suggestion to pre-allocate the row names was wrong, the fact that there are new row.names does not significantly affect execution time. R is creating a new integer vector of length 2, regardless of the size of the data. This takes almost no time. It is probably not worth adding logic to the R source to establish whether the rows are the same, in order to avoid such a tiny operation.
It is notable in your example, and the answer by Joris C., that operations take longer if they include attr(new, "row.names") <- attr(dat, "row.names"), either individually or as part of a larger function call such as utils::modifyList(attributes(dat), attributes(new)). Let's try the simple way:
attr(new, "row.names") <- attr(dat, "row.names")
get_dat_obj_tree(sxp(new))
1 dat VECSXP length: 2 mem_addr:0x15
2 ¦--x_int INTSXP length: 10 mem_addr:0x2
3 ¦--x_char STRSXP length: 10 mem_addr:0x1
4 °--_attrib LISTSXP length: 3 mem_addr:0x13
5 ¦--names STRSXP length: 2 mem_addr:0x10
6 ¦--class STRSXP length: 1 mem_addr:0x5
7 °--row.names INTSXP length: 2 mem_addr:0x14
There's a new memory address. But the row.names attribute of new is still an integer vector of length 2. If we run dput(new) we will see row.names = c(NA, -10L).
So if we are copying an integer vector of length 2 from one place to another, regardless of the size of the data, why is it taking longer with larger data frames? The answer to this is what happens when you run:
attr(new, "row.names") <- attr(dat, "row.names")
This is syntactic sugar for:
new <- `attr<-`(new, "row.names", attr(dat, "row.names"))
Firstly, this means that we are evaluating the row.names for dat. Secondly, as R internals notes, with a similar example, a <- `dim<-`(a, c(7, 2)):
in principle two copies of
aexist for the duration of the computation
So this may be happening twice.
An easier way to understand this is by printing the right-hand side of that function call.
`attr<-`(new, "row.names", attr(dat, "row.names"))
# <truncated>
# attr(,"row.names")
# [1] 1 2 3 4 5 6 7 8 9 10
By the time the row.names are stored in new, the R source code in attrib.c, it is clever enough to restore it to c(NA, n) form:
INTEGER(val)[0] = NA_INTEGER;
INTEGER(val)[1] = n; // +n: compacted *and* automatic row names
However, the damage is done, the short form c(NA, -10) row names were fully evaluated, which as you would expect (and have demonstrated) takes more time for longer vectors of row names.
It is possible to avoid this issue in base R, and also with data.table and tidyverse packages.
The main point is - do not copy the row names from one data frame to another. The function suggested by Joris C. to copy any attributes that were not copied by the subset operation, rather than copying all attributes, is a good base R solution.
An alternative is to convert the data frame to a data.table and using data.table::setattr() to set attributes by reference:
library(data.table)
orig <- data.frame(x1 = 1, x2 = 2)
setDT(orig)
mem_location <- tracemem(orig)
setattr(orig, "foo", TRUE)
tracemem(orig) == mem_location # TRUE
attr(orig, "foo") # TRUE
Additionally, with data.table you can change the column order by reference so you do not lose the attributes when you reorder the columns:
setcolorder(orig, c(2,1))
attr(orig, "foo") # TRUE
orig
# x2 x1
# 1: 2 1
Similarly, a tibble() keeps its row.names attribute when you subset columns:
library(tibble)
set.seed(0)
num_rows <- 10
dat <- tibble(
x_char = sample(letters, num_rows),
x_int = sample(1:10, num_rows)
)
attr(dat, "foo") <- TRUE
new <- dat[,c(2,1)]
attr(new, "foo") # TRUE
I went down several dead-ends with this one, and posted two answers that were not quite right before I understood what was really happening under the hood. But I learned a lot about R in the process. Thanks for asking such an interesting question.
The reason the computation time of copying all data.frame attributes scales with the size of the data.frame seems to be mainly due to the row.names attribute.
We can check that copying the row.names attribute is responsible for most of the computation time:
orig <- data.frame(x1 = rep(1, 1e7), x2 = rep(2, 1e7))
attr(orig, "foo") <- TRUE
new <- orig[, c(2, 1)]
microbenchmark::microbenchmark(
all_attrs = { attributes(new) <- attributes(orig) },
rownames = { attr(new, "row.names") <- attr(orig, "row.names") },
foo = { attr(new, "foo") <- attr(orig, "foo") },
times = 10,
unit = "ms"
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> all_attrs 60.477554 61.18414 64.3562408 61.9978505 67.117645 72.827139 10
#> rownames 59.831147 61.21029 69.6012781 64.2950890 68.880676 106.280348 10
#> foo 0.001043 0.00206 0.0072771 0.0087225 0.011206 0.015295 10
If we compare this to copying the foo attribute in the case of the small data.frame, the timing is (roughly) of the same order:
orig <- data.frame(x1 = 1, x2 = 2)
attr(orig, "foo") <- TRUE
new <- orig[, c(2, 1)]
microbenchmark::microbenchmark(
foo = { attr(new, "foo") <- attr(orig, "foo") },
unit = "ms"
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> foo 0.00115 0.00118 0.00146262 0.0012055 0.0012725 0.022368 100
To be efficient you can choose to only copy any custom defined attributes (instead of all data.frame attributes). For instance:
## replace only custom attributes
replace_attrs <- function(obj, new_attrs) {
for(nm in setdiff(names(new_attrs), names(attributes(data.frame())))) {
attr(obj, which = nm) <- new_attrs[[nm]]
}
return(obj)
}
new <- replace_attrs(new, attributes(orig))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With