(This is a follow up question to this.)
Check this toy code:
> x <- data.frame(a = 1:2)
> foo <- function(z) { setDT(z) ; z[, b:=3:4] ; z }
> y <- foo(x)
>
> class(x)
[1] "data.table" "data.frame"
> x
a
1: 1
2: 2
It looks like setDT did change x's class, but the addition of data did not apply to x.
What happened here?
Modify / Add / Delete columns To modify an existing column, or create a new one, use the := operator. Using the data. table := operator modifies the existing object 'in place', which has the benefit of being memory-efficient. Memory management is an important aspect of data.
To access a specific column in a dataframe by name, you use the $ operator in the form df$name where df is the name of the dataframe, and name is the name of the column you are interested in.
Data Visualization using R ProgrammingA data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column.
Method 1: Using subset() This is one of the easiest approaches to drop columns is by using the subset() function with the '-' sign which indicates dropping variables. This function in R Language is used to create subsets of a Data frame and can also be used to drop columns from a data frame.
In your function z
is a reference to x
until setDT
.
library(data.table)
foo <- function(z) {print(address(z)); setDT(z); print(address(z))}
x <- data.frame(a = 1:2)
address(x)
#[1] "0x555ec9a471e8"
foo(x)
#[1] "0x555ec9a471e8"
#[1] "0x555ec9ede300"
In setDT
it comes to the following line where z
is still pointing to the same address like x
:
setattr(z, "class", data.table:::.resetclass(z, "data.frame"))
setattr
does not make a copy. So x
and z
are still pointing to the same address and both are now of class data.frame
:
x <- data.frame(a = 1:2)
z <- x
class(x)
#[1] "data.frame"
address(x)
#[1] "0x555ec95de600"
address(z)
#[1] "0x555ec95de600"
setattr(z, "class", data.table:::.resetclass(z, "data.frame"))
class(x)
#[1] "data.table" "data.frame"
address(x)
#[1] "0x555ec95de600"
address(z)
#[1] "0x555ec95de600"
Then setalloccol
is called which calls in this case:
assign("z", .Call(data.table:::Calloccolwrapper, z, 1024, FALSE))
which now let x
and z
point to different addresses.
address(x)
#[1] "0x555ecaa09c00"
address(z)
#[1] "0x555ec95de600"
And both have the class
data.frame
class(x)
#[1] "data.table" "data.frame"
class(z)
#[1] "data.table" "data.frame"
I think when they would have used
class(z) <- data.table:::.resetclass(z, "data.frame")
instead of
setattr(z, "class", data.table:::.resetclass(z, "data.frame"))
the problem would not occur.
x <- data.frame(a = 1:2)
z <- x
address(x)
#[1] "0x555ec9cd2228"
class(z) <- data.table:::.resetclass(z, "data.frame")
class(x)
#[1] "data.frame"
class(z)
#[1] "data.table" "data.frame"
address(x)
#[1] "0x555ec9cd2228"
address(z)
#[1] "0x555ec9cd65a8"
but after class(z) <- value
z
will not point to the same address where it points before:
z <- data.frame(a = 1:2)
address(z)
#[1] "0x5653dbe72b68"
address(z$a)
#[1] "0x5653db82e140"
class(z) <- c("data.table", "data.frame")
address(z)
#[1] "0x5653dbe82d98"
address(z$a)
#[1] "0x5653db82e140"
but after setDT
it will also not point to the same address where it points before:
z <- data.frame(a = 1:2)
address(z)
#[1] "0x55b6f04d0db8"
setDT(z)
address(z)
#[1] "0x55b6efe1e0e0"
As @Matt-dowle pointed out, it is also possible to change the data in x
over z
:
x <- data.frame(a = c(1,3))
z <- x
setDT(z)
z[, b:=3:4]
z[2, a:=7]
z
# a b
#1: 1 3
#2: 7 4
x
# a
#1: 1
#2: 7
R.version.string
#[1] "R version 4.0.2 (2020-06-22)"
packageVersion("data.table")
#[1] ‘1.12.8’
A supplement to GKi's answer:
setalloccol
's location is indeed the direct culprit: it performs a shallow copy (i.e., generates a new vector of pointers to the existing data columns) and in addition allocates extra 1024 (by default) slots for additional columns. If setting the class to data.frame
is performed after this shallow copy (either by class(z)<-
or by setattr
) it is applied to this new vector and not the original argument.
However.
Even after using a fixed version of setDT (with setattr
called after setalloccol
), it seems there is no way to get consistent behaviour. Some operations apply to the caller copy, and some don't.
df <- data.frame(a=1:2, b=3:4)
foo1 <- function(z) {
setDT.fixed(z)
z[, b:=5] # will apply to the caller copy
data.table::setDF(z)
}
foo1(df)
# a b
# 1: 1 5
# 2: 2 5
class(df)
# [1] "data.frame"
df
# a b
# 1 1 5
# 2 2 5
foo2 <- function(z) {
setDT.fixed(z)
z[, c:=5] # will NOT apply to the caller copy
data.table::setDF(z)
}
foo2(df)
# a b c
# 1: 1 3 5
# 2: 2 4 5
# Warning message:
# In `[.data.table`(z, , `:=`(c, 5)) :
# Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.
class(df)
# [1] "data.table" "data.frame"
df
# a b
# 1: 1 3
# 2: 2 4
(Using the j
argument, e.g., z[!is.na(a), b:=6]
gives an extra dimension of weirdness which I won't go into here).
Bottom line, the data.table package took on the brave task of punching a hole in R's all-value semantics. It was pretty successful until setDT came along (BTW, in response to a SO question here). Using setDT within a function on an argument will probably never have consistent semantics and is almost guaranteed to get you nasty surprises.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With