Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: select range of columns in data.table

I am trying to get to grips with the documentation of data.table, but I would like to ask for feedback, where I am going wrong in my reasoning concerning the following.

(1) I would like to select a range of columns from a data.table to create a new data.table.

(2) Also, I would like to just take the first value of each group. With regard to the first question I think the answer is given here ,but then with regard to column numbers. But I would like to use column names, which I think is one of the main strengths (and selling points) of data.table.

Here's an example dataset.

DT <- data.table(ID=c(101,101,101,102,103,104,104),
                 "year.1" = c(1,5,3,2,3,4,8), 
                 "year.2" = c(4,5,6,NA,1,2,3), 
                 "year.3" = c(1,2,3,7,9,8,0), 
                 "year.4" = c(4,5,NA,1,2,6,9))
setkey(DT,ID)

In reality I have much more columns and not just for "year".

# ALL OF THESE DONT WORK AND END IN ERRORS
# To extract a range of columns I have tried this:
dt.sub <- DT[,list(year.1:year.3,ID)]
dt.sub <- DT[,c("year.1":"year.3",ID), with=FALSE] # I know shouldn't work since 
# "with=FALSE" is only intended in combination with := according to the documentation
dt.sub <- DT[,lapply(SD),.SDcols= for (i in 1:3) paste0("year.",i) ]

For the second question: if I wanted dt.sub to contain only the first observation of each group, I would expect that I could use the "mult" argument. However this also works in a different way than I anticipate. Using the example on just one column:

dt.sub1 <- DT[,year.1, by=ID,mult="first",]

This provides no error, but also does not just give the first line of the group. I know that a workaround like:

dt.sub1 <- unique(DT[,year.1, by=ID])

does provide the expected output, but I feel like I am missing something important with the mult option.

like image 451
Richard Avatar asked Mar 24 '23 03:03

Richard


1 Answers

# (1)
DT[, c(paste0('year.', 1:3), 'ID'), with = F]

# (2) 
DT[, year.1[1], by = ID]

mult is used when merging/joining two data.tables and signifies what to do when multiple matches exist. Therefore, as @Arun pointed out, the way to use mult for your 2nd question would be (given that you are already keyed by ID):

DT[J(unique(ID)), list(ID, year.1), mult = 'first']
like image 131
eddi Avatar answered Apr 05 '23 20:04

eddi