I would like to ask whether the following behavior of data.table
is a feature or a bug.
Given the data.table
dt = data.table(
group = c(rep('group1',5),rep('group2',5)),
x = as.numeric(c(1:5, 1:5)),
y = as.numeric(c(5:1, 5:1)),
z = as.numeric(c(1,2,3,2,1, 1,2,3,2,1))
)
and a vector of column names containing a duplicate,
cols = c('y','x','y','z') # contains a duplicate column name
data.table
rightly prevents me from assigning values to the duplicate column names:
dt[,(cols) := lapply(.SD,identity), .SDcols=cols] # Error (OK)
This seems like appropriate behavior to me, because it can help avoid unintended consequences. However, if I do the same assignment by groups,
dt[,(cols) := lapply(.SD,identity), .SDcols=cols, by=group] # No error!
then data.table
doesn't throw an error. The assignment goes through, and one can verify that columns y
and z
have been interchanged.
This occurred for me in a large application while demeaning variables by group, and it was difficult to trace the source of this behavior. The recommendation for the user, of course, is to avoid duplicate column names when assigning, and to avoid providing duplicate names to .SDcols
. However, would it not be better for data.table
to throw an error in this situation?
SD stands for "Subset of Data. table". The dot before SD has no significance but doesn't let it clash with a user-defined column name.
In the expanded data panel, click on the icon with a pencil to the right of the data table name. Tip: You can also double-click on the data table name to enable editing. Type a new name for the data table and finish by pressing Enter. The data table name is updated.
Select File > Add Data Tables... and add the data of interest. Click Show transformations. Select Change column names from the drop-down list and click Add....
This is a bug, which was fixed in version 1.12.4 of data.table
. Here is the bug report: https://github.com/Rdatatable/data.table/issues/4874.
Other users with this issue can simply update their package version, for example using install.packages('data.table')
. To check the current package version, load data.table
and then look at the output of sessionInfo()
.
But it would be wise to avoid supplying duplicate column names to .SDcols
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With