Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

data.table switches column names

Tags:

r

data.table

I would like to ask whether the following behavior of data.table is a feature or a bug.

Given the data.table

dt = data.table(
  group = c(rep('group1',5),rep('group2',5)),
  x = as.numeric(c(1:5, 1:5)),
  y = as.numeric(c(5:1, 5:1)),
  z = as.numeric(c(1,2,3,2,1, 1,2,3,2,1))
)

and a vector of column names containing a duplicate,

cols = c('y','x','y','z') # contains a duplicate column name

data.table rightly prevents me from assigning values to the duplicate column names:

dt[,(cols) := lapply(.SD,identity), .SDcols=cols] # Error (OK)

This seems like appropriate behavior to me, because it can help avoid unintended consequences. However, if I do the same assignment by groups,

dt[,(cols) := lapply(.SD,identity), .SDcols=cols, by=group] # No error!

then data.table doesn't throw an error. The assignment goes through, and one can verify that columns y and z have been interchanged.

This occurred for me in a large application while demeaning variables by group, and it was difficult to trace the source of this behavior. The recommendation for the user, of course, is to avoid duplicate column names when assigning, and to avoid providing duplicate names to .SDcols. However, would it not be better for data.table to throw an error in this situation?

like image 534
JS1204 Avatar asked Jan 13 '21 22:01

JS1204


People also ask

What is .SD in data table?

SD stands for "Subset of Data. table". The dot before SD has no significance but doesn't let it clash with a user-defined column name.

How do you change the name of a data table?

In the expanded data panel, click on the icon with a pencil to the right of the data table name. Tip: You can also double-click on the data table name to enable editing. Type a new name for the data table and finish by pressing Enter. The data table name is updated.

How do I change a column name in Spotfire?

Select File > Add Data Tables... and add the data of interest. Click Show transformations. Select Change column names from the drop-down list and click Add....


1 Answers

This is a bug, which was fixed in version 1.12.4 of data.table. Here is the bug report: https://github.com/Rdatatable/data.table/issues/4874.

Other users with this issue can simply update their package version, for example using install.packages('data.table'). To check the current package version, load data.table and then look at the output of sessionInfo().

But it would be wise to avoid supplying duplicate column names to .SDcols.

like image 109
JS1204 Avatar answered Oct 21 '22 03:10

JS1204