Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can data.table handle identical column names when using .SDcols?

When using .SD to apply a function to a subset of dt's columns I can't seem to find the correct way to handle the situation where I have duplicated column names... e.g.

#  Make some data
set.seed(123)
dt <- data.table( matrix( sample(6,16,repl=T) , 4 ) )
setnames(dt , rep( letters[1:2] , 2 ) )
#   a b a b
#1: 2 6 4 5
#2: 5 1 3 4
#3: 3 4 6 1
#4: 6 6 3 6

#  Use .SDcols to multiply both column 'a' specifying them by numeric position
dt[ , lapply( .SD , `*`  , 2 ) , .SDcols = which( names(dt) %in% "a" ) ]
#    a  a
#1:  4  4
#2: 10 10
#3:  6  6
#4: 12 12

I couldn't get it to work with when .SDcols was a character vector of column names so I tried numeric positions (which( names(dt) %in% "a" ) gives a vector [1] 1 3 ) but it also seems to just multiply the first a column only. Am I doing something wrong?

.SDcols Advanced. Specifies the columns of x included in .SD. May be character column names or numeric positions.

These also returned the same result as above...

dt[ , lapply( .SD ,function(x) x*2 ) , .SDcols = which( names(dt) %in% "a" ) ]
dt[ , lapply( .SD ,function(x) x*2 ) , .SDcols = c(1,3) ]

packageVersion("data.table")
#[1] ‘1.8.11’
like image 711
Simon O'Hanlon Avatar asked Dec 06 '25 08:12

Simon O'Hanlon


2 Answers

How about this

dt[, "a"] * 2
##    a a.1
## 1  4   8
## 2 10   6
## 3  6  12
## 4 12   6

For more detailed discussion

https://chat.stackoverflow.com/transcript/message/12783493#12783493

like image 159
CHP Avatar answered Dec 08 '25 21:12

CHP


This now works as intended since 1.9.4. From NEWS:

Consistent subset rules on data.tables with duplicate columns. In short, if indices are directly provided, 'j', or in .SDcols, then just those columns are either returned (or deleted if you provide -.SDcols or !j). If instead, column names are given and there are more than one occurrence of that column, then it's hard to decide which to keep and which to remove on a subset. Therefore, to remove, all occurrences of that column are removed, and to keep, always the first column is returned each time. Also closes #5688 and #5008. Note that using by= to aggregate on duplicate columns may not give intended result still, as it may not operate on the proper column.

Basically, if you do:

dt[, lapply(.SD, `*`, 2), .SDcols=c("a", "a")]
#     a  a
# 1:  4  4
# 2: 10 10
# 3:  6  6
# 4: 12 12

It'll still give the unintended result, as it's hard to tell which "a" you're mentioning each time - so choosing the first always.

But if you clearly specify (as you do in your Q):

dt[, lapply(.SD, `*`, 2), .SDcols=which( names(dt) %in% "a" )]
#     a  a
# 1:  4  8
# 2: 10  6
# 3:  6 12
# 4: 12  6
like image 41
Arun Avatar answered Dec 08 '25 22:12

Arun



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!