Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

understanding the reference properties of data.table in R

Tags:

r

data.table

Just to clear some stuff up for myself, I would like to better understand when copies are made and when they are not in data.table. As this question points out Understanding exactly when a data.table is a reference to (vs a copy of) another data.table, if one simply runs the following then you end up modifying the original:

library(data.table)

DT <- data.table(a=c(1,2), b=c(11,12))
print(DT)
#      a  b
# [1,] 1 11
# [2,] 2 12

newDT <- DT        # reference, not copy
newDT[1, a := 100] # modify new DT

print(DT)          # DT is modified too.
#        a  b
# [1,] 100 11
# [2,]   2 12

However, if one does this (for example), then you end up modifying the new version:

DT = data.table(a=1:10)
DT
     a
 1:  1
 2:  2
 3:  3
 4:  4
 5:  5
 6:  6
 7:  7
 8:  8
 9:  9
10: 10

newDT = DT[a<11]
newDT
     a
 1:  1
 2:  2
 3:  3
 4:  4
 5:  5
 6:  6
 7:  7
 8:  8
 9:  9
10: 10

newDT[1:5,a:=0L]

newDT
     a
 1:  0
 2:  0
 3:  0
 4:  0
 5:  0
 6:  6
 7:  7
 8:  8
 9:  9
10: 10

DT
     a
 1:  1
 2:  2
 3:  3
 4:  4
 5:  5
 6:  6
 7:  7
 8:  8
 9:  9
10: 10

As I understand it, the reason this happens is because when you execute a i statement, data.table returns a whole new table as opposed to a reference to the memory occupied by the select elements of the old data.table. Is this correct and true?

EDIT: sorry i meant i not j (changed this above)

like image 284
Alex Avatar asked Apr 08 '13 22:04

Alex


1 Answers

When you create newDT in the second example, you are evaluating i(not j). := assigns by reference within the j argument. There are no equivalents in the i statement, as the self reference over allocates the columns, but not the rows.

A data.table is a list. It has length == the number of columns, but is over allocated so you can add more columns without copying the entire table (eg using := in j)

If we inspect the data.table, then we can see the truelength (tl = 100) -- that is the numbe of column pointer slots

 .Internal(inspect(DT))
@1427d6c8 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=1, tl=100)
  @b249a30 13 INTSXP g0c4 [NAM(2)] (len=10, tl=0) 1,2,3,4,5,...

Within the data.table each element has length 10, and tl=0. Currently there is no method to increase the truelength of the columns to allow appending extra rows by reference.

From ?truelength

Currently, it's just the list vector of column pointers that is over-allocated (i.e. truelength(DT)), not the column vectors themselves, which would in future allow fast row insert()

When you evaluate i, data.table doesn't check whether you have simply returned all rows in the same order as in the original (and then not copy only in that case), it simply returns the copy.

like image 183
mnel Avatar answered Nov 04 '22 20:11

mnel