Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Make a shallow copy in data.table

I read in an SO topic an answer from Matt Dowle about a shallow function to make shallow copies in data.table. However, I can't find the topic again.

data.table does not have any exported function called shallow. There is an internal one but not documented. Can I use it safely? What is its behavior?

What I would like to do is a memory efficient copy of a big table. Let DT be a big table with n columns and f a function which memory efficiently adds a column. Is something like that possible?

DT2 = f(DT)

with DT2 being a data.table with n columns pointing to the original adresses (no deep copies) and an extra one existing only for DT2. If yes, what appends to DT1 if I do DT2[, col3 := NULL]?

like image 237
JRR Avatar asked Aug 28 '17 18:08

JRR


People also ask

What is the shallow copy?

A shallow copy of an object is a copy whose properties share the same references (point to the same underlying values) as those of the source object from which the copy was made.

When would you use a shallow copy?

In fact a shallow copy is the way with least effort, doing less. It is especially suited for immutable objects, where sharing is optimal. An immutable object does not have an internal state, cannot be changed, only variables can be set to another value.

What is the difference in a shallow copy vs deep copy?

Shallow Copy stores the references of objects to the original memory address. Deep copy stores copies of the object's value. Shallow Copy reflects changes made to the new/copied object in the original object. Deep copy doesn't reflect changes made to the new/copied object in the original object.

What is a shallow copy in Python?

A shallow copy means constructing a new collection object and then populating it with references to the child objects found in the original. In essence, a shallow copy is only one level deep. The copying process does not recurse and therefore won't create copies of the child objects themselves.


1 Answers

You can't use data.table:::shallow safely, no. It is deliberately not exported and not meant for user use. Either from the point of view of it itself working, or its name or arguments changing in future.

Having said this, you could decide to use it as long as you can either i) guarantee that := or set* won't be called on the result either by you or your users (if you're creating a package) or ii) if := or set* is called on the result then you're ok with both objects being changed by reference. When shallow is used internally by data.table, that's what we promise ourselves.

More background in this answer a few days ago here : https://stackoverflow.com/a/45891502/403310

In that question I asked for the bigger picture: why is this needed? Having that clear would help to raise the priority in either investigating ALTREP or perhaps doing our own reference count.

In your question you alluded to your bigger picture and that is very useful. So you'd like to create a function which adds working columns to a big data.table inside the function but doesn't change the big data.table. Can you explain more why you'd like to create a function like that? Why not load the big data.table, add the ephemeral working columns directly to it, and then proceed. Your R session is already a working copy in memory of the data which is persistent somewhere else.

Note that I am not saying no. I'm not saying that you don't have a valid reason. I'm asking to discover more about that valid reason so the priority can be raised.

If that isn't the answer you had seen, there are currently 39 question or answers returned by the search string "[data.table] shallow". Worst case, you could trawl through those to find it again.

like image 164
Matt Dowle Avatar answered Oct 09 '22 09:10

Matt Dowle