In writing a package which relies on data.table
, I've discovered some odd behavior. I have a function which removes and reorders some column by-reference, and it works just fine, meaning the data.table
I passed in was modified without assigning the function output. I have another function which adds new columns however, but those changes do not always persist in the data.table
which was passed in.
Here's a small example:
library(data.table) # I'm using 1.9.4
test <- data.table(id = letters[1:2], val=1:2)
foobar <- function(dt, col) {
dt[, (col) := 1]
invisible(dt)
}
test
# id val
#1: a 1
#2: b 2
saveRDS(test, "test.rds")
test2 <- readRDS("test.rds")
all.equal(test, test2)
#[1] TRUE
foobar(test, "new")
test
# id val new
#1: a 1 1
#2: b 2 1
foobar(test2, "new")
test2
# id val
#1: a 1
#2: b 2
What happened? What's different about test2
? I can modify existing columns in-place on either:
foobar(test, "val")
test
# id val new
#1: a 1 1
#2: b 1 1
foobar(test2, "val")
test2
# id val
#1: a 1
#2: b 1
But adding to test2
still doesn't work:
foobar(test2, "someothercol")
.Last.value
# id val someothercol
#1: a 1 1
#2: b 1 1
test2
# id val
#1: a 1
#2: b 1
I can't pin down all the cases where I see this behavior, but saving to and reading from RDS is the first case I can reliably replicate. Writing to and reading from a CSV doesn't seem to have the same problem.
Is this a pointer issue ala this issue, like serializing a data.table destroys the over-allocated pointers? Is there a simple way to restore them? How could I check for them inside my function, so I could restore the pointers or error if the operation isn't going to work?
I know I can assign the function output as a workaround, but that's not very data.table
-y. Wouldn't that also create a temporary copy in memory?
Arun has instructed that it is indeed a pointer issue, which can be diagnosed with truelength
and fixed with setDT
or alloc.col
. I ran into a problem encapsulating his solution in a function (continuing from above code):
func <- function(dt) {if (!truelength(dt)) setDT(dt)}
func2 <- function(dt) {if (!truelength(dt)) alloc.col(dt)}
test2 <- readRDS("test.rds")
truelength(test2)
#[1] 0
truelength(func(test2))
#[1] 100
truelength(test2)
#[1] 0
truelength(func2(test2))
#[1] 100
truelength(test2)
#[1] 0
So it looks like the local copy inside the function is being properly modified, but the reference version is not. Why not?
You create DataColumn objects within a table by using the DataColumn constructor, or by calling the Add method of the Columns property of the table, which is a DataColumnCollection. The Add method accepts optional ColumnName, DataType, and Expression arguments and creates a new DataColumn as a member of the collection.
Is this a pointer issue ala this issue, like serializing a data.table destroys the over-allocated pointers?
Yes loading from disk sets the external pointer to NULL. We will have to over-allocate again.
Is there a simple way to restore them?
Yes. You can test for truelength()
of the data.table, and if it's 0
, then use setDT()
or alloc.col()
on it.
truelength(test2) # [1] 0
if (!truelength(test2))
setDT(test2)
truelength(test2) # [1] 100
foobar(test2, "new")
test2[]
# id val new
# 1: a 1 1
# 2: b 2 1
This should probably go in as a FAQ (can't remember seeing it there).
Already in FAQ in Warning Messages section.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With