Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Adding new columns to a data.table by-reference within a function not always working

Tags:

r

data.table

In writing a package which relies on data.table, I've discovered some odd behavior. I have a function which removes and reorders some column by-reference, and it works just fine, meaning the data.table I passed in was modified without assigning the function output. I have another function which adds new columns however, but those changes do not always persist in the data.table which was passed in.

Here's a small example:

library(data.table)  # I'm using 1.9.4
test <- data.table(id = letters[1:2], val=1:2)
foobar <- function(dt, col) {
    dt[, (col) := 1]
    invisible(dt)
}

test
#  id val
#1: a   1
#2: b   2
saveRDS(test, "test.rds")
test2 <- readRDS("test.rds")
all.equal(test, test2)
#[1] TRUE
foobar(test, "new")
test
#  id val new
#1: a   1   1
#2: b   2   1
foobar(test2, "new")
test2
#  id val
#1: a   1
#2: b   2

What happened? What's different about test2? I can modify existing columns in-place on either:

foobar(test, "val")
test
#  id val new
#1: a   1   1
#2: b   1   1
foobar(test2, "val")
test2
#  id val
#1: a   1
#2: b   1

But adding to test2 still doesn't work:

foobar(test2, "someothercol")
.Last.value
#  id val someothercol
#1: a   1            1
#2: b   1            1
test2
#  id val
#1: a   1
#2: b   1

I can't pin down all the cases where I see this behavior, but saving to and reading from RDS is the first case I can reliably replicate. Writing to and reading from a CSV doesn't seem to have the same problem.

Is this a pointer issue ala this issue, like serializing a data.table destroys the over-allocated pointers? Is there a simple way to restore them? How could I check for them inside my function, so I could restore the pointers or error if the operation isn't going to work?

I know I can assign the function output as a workaround, but that's not very data.table-y. Wouldn't that also create a temporary copy in memory?

Response to Arun's solution

Arun has instructed that it is indeed a pointer issue, which can be diagnosed with truelength and fixed with setDT or alloc.col. I ran into a problem encapsulating his solution in a function (continuing from above code):

func <- function(dt) {if (!truelength(dt)) setDT(dt)}
func2 <- function(dt) {if (!truelength(dt)) alloc.col(dt)}
test2 <- readRDS("test.rds")
truelength(test2)
#[1] 0
truelength(func(test2))
#[1] 100
truelength(test2)
#[1] 0
truelength(func2(test2))
#[1] 100
truelength(test2)
#[1] 0

So it looks like the local copy inside the function is being properly modified, but the reference version is not. Why not?

like image 244
ClaytonJY Avatar asked Jan 21 '15 23:01

ClaytonJY


People also ask

How to Add a new column to DataTable?

You create DataColumn objects within a table by using the DataColumn constructor, or by calling the Add method of the Columns property of the table, which is a DataColumnCollection. The Add method accepts optional ColumnName, DataType, and Expression arguments and creates a new DataColumn as a member of the collection.


1 Answers

Is this a pointer issue ala this issue, like serializing a data.table destroys the over-allocated pointers?

Yes loading from disk sets the external pointer to NULL. We will have to over-allocate again.

Is there a simple way to restore them?

Yes. You can test for truelength() of the data.table, and if it's 0, then use setDT() or alloc.col() on it.

truelength(test2) # [1] 0
if (!truelength(test2))
    setDT(test2)
truelength(test2) # [1] 100

foobar(test2, "new")
test2[]
#    id val new
# 1:  a   1   1
# 2:  b   2   1

This should probably go in as a FAQ (can't remember seeing it there).
Already in FAQ in Warning Messages section.

like image 134
Arun Avatar answered Oct 14 '22 00:10

Arun