I read in an SO topic an answer from Matt Dowle about a shallow
function to make shallow copies in data.table
. However, I can't find the topic again.
data.table
does not have any exported function called shallow
. There is an internal one but not documented. Can I use it safely? What is its behavior?
What I would like to do is a memory efficient copy of a big table. Let DT
be a big table with n
columns and f
a function which memory efficiently adds a column. Is something like that possible?
DT2 = f(DT)
with DT2
being a data.table
with n
columns pointing to the original adresses (no deep copies) and an extra one existing only for DT2
. If yes, what appends to DT1
if I do DT2[, col3 := NULL]
?
A shallow copy of an object is a copy whose properties share the same references (point to the same underlying values) as those of the source object from which the copy was made.
In fact a shallow copy is the way with least effort, doing less. It is especially suited for immutable objects, where sharing is optimal. An immutable object does not have an internal state, cannot be changed, only variables can be set to another value.
Shallow Copy stores the references of objects to the original memory address. Deep copy stores copies of the object's value. Shallow Copy reflects changes made to the new/copied object in the original object. Deep copy doesn't reflect changes made to the new/copied object in the original object.
A shallow copy means constructing a new collection object and then populating it with references to the child objects found in the original. In essence, a shallow copy is only one level deep. The copying process does not recurse and therefore won't create copies of the child objects themselves.
You can't use data.table:::shallow
safely, no. It is deliberately not exported and not meant for user use. Either from the point of view of it itself working, or its name or arguments changing in future.
Having said this, you could decide to use it as long as you can either i) guarantee that :=
or set*
won't be called on the result either by you or your users (if you're creating a package) or ii) if :=
or set*
is called on the result then you're ok with both objects being changed by reference. When shallow is used internally by data.table, that's what we promise ourselves.
More background in this answer a few days ago here : https://stackoverflow.com/a/45891502/403310
In that question I asked for the bigger picture: why is this needed? Having that clear would help to raise the priority in either investigating ALTREP or perhaps doing our own reference count.
In your question you alluded to your bigger picture and that is very useful. So you'd like to create a function which adds working columns to a big data.table inside the function but doesn't change the big data.table. Can you explain more why you'd like to create a function like that? Why not load the big data.table, add the ephemeral working columns directly to it, and then proceed. Your R session is already a working copy in memory of the data which is persistent somewhere else.
Note that I am not saying no. I'm not saying that you don't have a valid reason. I'm asking to discover more about that valid reason so the priority can be raised.
If that isn't the answer you had seen, there are currently 39 question or answers returned by the search string "[data.table] shallow". Worst case, you could trawl through those to find it again.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With