Can data.table handle identical column names when using .SDcols?

Question

When using .SD to apply a function to a subset of dt's columns I can't seem to find the correct way to handle the situation where I have duplicated column names... e.g.

#  Make some data
set.seed(123)
dt <- data.table( matrix( sample(6,16,repl=T) , 4 ) )
setnames(dt , rep( letters[1:2] , 2 ) )
#   a b a b
#1: 2 6 4 5
#2: 5 1 3 4
#3: 3 4 6 1
#4: 6 6 3 6

#  Use .SDcols to multiply both column 'a' specifying them by numeric position
dt[ , lapply( .SD , `*`  , 2 ) , .SDcols = which( names(dt) %in% "a" ) ]
#    a  a
#1:  4  4
#2: 10 10
#3:  6  6
#4: 12 12

I couldn't get it to work with when .SDcols was a character vector of column names so I tried numeric positions (which( names(dt) %in% "a" ) gives a vector [1] 1 3 ) but it also seems to just multiply the first a column only. Am I doing something wrong?

.SDcols Advanced. Specifies the columns of x included in .SD. May be character column names or numeric positions.

These also returned the same result as above...

dt[ , lapply( .SD ,function(x) x*2 ) , .SDcols = which( names(dt) %in% "a" ) ]
dt[ , lapply( .SD ,function(x) x*2 ) , .SDcols = c(1,3) ]

packageVersion("data.table")
#[1] ‘1.8.11’

CHP · Accepted Answer

How about this

dt[, "a"] * 2
##    a a.1
## 1  4   8
## 2 10   6
## 3  6  12
## 4 12   6

For more detailed discussion

https://chat.stackoverflow.com/transcript/message/12783493#12783493

Arun · Answer

This now works as intended since 1.9.4. From NEWS:

Consistent subset rules on data.tables with duplicate columns. In short, if indices are directly provided, 'j', or in .SDcols, then just those columns are either returned (or deleted if you provide -.SDcols or !j). If instead, column names are given and there are more than one occurrence of that column, then it's hard to decide which to keep and which to remove on a subset. Therefore, to remove, all occurrences of that column are removed, and to keep, always the first column is returned each time. Also closes #5688 and #5008. Note that using by= to aggregate on duplicate columns may not give intended result still, as it may not operate on the proper column.

Basically, if you do:

dt[, lapply(.SD, `*`, 2), .SDcols=c("a", "a")]
#     a  a
# 1:  4  4
# 2: 10 10
# 3:  6  6
# 4: 12 12

It'll still give the unintended result, as it's hard to tell which "a" you're mentioning each time - so choosing the first always.

But if you clearly specify (as you do in your Q):

dt[, lapply(.SD, `*`, 2), .SDcols=which( names(dt) %in% "a" )]
#     a  a
# 1:  4  8
# 2: 10  6
# 3:  6 12
# 4: 12  6

Can data.table handle identical column names when using .SDcols?

Tags:

syntax

r

data.table

Simon O'Hanlon

2 Answers

CHP

This now works as intended since 1.9.4. From NEWS:

Arun

Recent Activity

Donate For Us

Can data.table handle identical column names when using .SDcols?

Tags:

syntax

r

data.table

Simon O'Hanlon

2 Answers

CHP

This now works as intended since 1.9.4. From NEWS:

Arun

Related questions

Recent Activity

Donate For Us