Just to clear some stuff up for myself, I would like to better understand when copies are made and when they are not in data.table
. As this question points out Understanding exactly when a data.table is a reference to (vs a copy of) another data.table, if one simply runs the following then you end up modifying the original:
library(data.table)
DT <- data.table(a=c(1,2), b=c(11,12))
print(DT)
# a b
# [1,] 1 11
# [2,] 2 12
newDT <- DT # reference, not copy
newDT[1, a := 100] # modify new DT
print(DT) # DT is modified too.
# a b
# [1,] 100 11
# [2,] 2 12
However, if one does this (for example), then you end up modifying the new version:
DT = data.table(a=1:10)
DT
a
1: 1
2: 2
3: 3
4: 4
5: 5
6: 6
7: 7
8: 8
9: 9
10: 10
newDT = DT[a<11]
newDT
a
1: 1
2: 2
3: 3
4: 4
5: 5
6: 6
7: 7
8: 8
9: 9
10: 10
newDT[1:5,a:=0L]
newDT
a
1: 0
2: 0
3: 0
4: 0
5: 0
6: 6
7: 7
8: 8
9: 9
10: 10
DT
a
1: 1
2: 2
3: 3
4: 4
5: 5
6: 6
7: 7
8: 8
9: 9
10: 10
As I understand it, the reason this happens is because when you execute a i
statement, data.table
returns a whole new table as opposed to a reference to the memory occupied by the select elements of the old data.table
. Is this correct and true?
EDIT: sorry i meant i
not j
(changed this above)
When you create newDT
in the second example, you are evaluating i
(not j
). :=
assigns by reference within the j
argument. There are no equivalents in the i
statement, as the self reference over allocates the columns, but not the rows.
A data.table
is a list. It has length == the number of columns, but is over allocated so you can add more columns without copying the entire table (eg using :=
in j
)
If we inspect the data.table, then we can see the truelength
(tl = 100
) -- that is the numbe of column pointer slots
.Internal(inspect(DT))
@1427d6c8 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=1, tl=100)
@b249a30 13 INTSXP g0c4 [NAM(2)] (len=10, tl=0) 1,2,3,4,5,...
Within the data.table each element has length 10
, and tl=0
. Currently there is no method to increase the truelength
of the columns to allow appending extra rows by reference.
From ?truelength
Currently, it's just the list vector of column pointers that is over-allocated (i.e. truelength(DT)), not the column vectors themselves, which would in future allow fast row insert()
When you evaluate i
, data.table
doesn't check whether you have simply returned all rows in the same order as in the original (and then not copy only in that case), it simply returns the copy.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With