Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does data.table get copied when adding a new column?

Tags:

r

data.table

When a new column is added to a data.table that is loaded from disk, it get copied.

library('data.table')
dt <- data.table(a=1,b=2)
save.image("test.RData")
load("test.RData")
dt
$   a b
$1: 1 2

class(dt)
$[1] "data.table" "data.frame"

address(dt)
$[1] "00000000046F1F38"  

dt[, b := NULL]
address(dt)
$[1] "00000000046F1F38"
dt[, c := 2]
address(dt)
$[1] "000000000D815618"

Is this a bug or am I doing something wrong? I am using 1.9.6 of the data.table package.

like image 724
imsc Avatar asked Feb 12 '16 12:02

imsc


People also ask

Why does DataTable return the same value?

Your Row or Column input cell is incorrect When you set up the data table it is important to make sure that you correctly assign the correct cell to the Row input cell and Column input cell. If you mix these two around, or click on the wrong cells, you will either get the same result or else nonsensical results.

How do you add columns to a DataTable?

You create DataColumn objects within a table by using the DataColumn constructor, or by calling the Add method of the Columns property of the table, which is a DataColumnCollection. The Add method accepts optional ColumnName, DataType, and Expression arguments and creates a new DataColumn as a member of the collection.

Why is my Excel table not auto expanding?

To expand Excel table column width automatically; you need to perform the following steps: Hit on your Excel table and then go to the Layout. Now from the Cell Size group tap to the format tab. At last hit the AutoFit Column Width.

When creating a DataTable what goes in the first column?

The leftmost column should be reserved for your independent variable. For example, if you're researching how much rain fell in the past year, your independent variable would be the months of the year. Thus, your leftmost column would be labeled "Month" and the next column would be labeled "Rainfall."


1 Answers

data.table avoids copies when adding columns by over-allocating pointer slots for the list of column vectors when the data.table is created. When you load the data.table like this, over-allocation has not happend and is done once you add a column. This makes a copy necessary.

library('data.table')
dt <- data.table(a=1,b=2)
save.image("test.RData")
load("test.RData")

truelength(dt)
#[1] 0

dt[, b := NULL]
truelength(dt)
#[1] 0

dt[, c := 2]
truelength(dt)
#[1] 101

To quote help("truelength"):

For tables loaded from disk however, truelength is 0 in R 2.14.0 and random in R <= 2.13.2; i.e., in both cases perhaps unexpected. data.table detects this state and over-allocates the loaded data.table when the next column addition or deletion occurs. All other operations on data.table (such as fast grouping and joins) do not need truelength.

It seems that the documentation is slightly out of date since the copy doesn't happen during deletion of a column.

Note that a copy also happens if you add more columns than have been over-allocated during "normal" creation of a data.table.

like image 99
Roland Avatar answered Oct 08 '22 02:10

Roland