Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select values from different columns based on a variable containing column names [duplicate]

Tags:

r

data.table

I have a data.table like this:

col1   col2   col3  new  
1       4     55    col1 
2       3     44    col2
3       34    35    col2
4       44    87    col3

I want to populate another column matched_value that contains the values from the respective column names given in the new column:

col1   col2   col3  new    matched_value
1       4     55    col1        1
2       3     44    col2        3
3       34    35    col2        34
4       44    87    col3        87 

E.g., in the first row, the value of new is "col1" so matched_value takes the value from col1, which is 1.

How can I do this efficiently in R on a very large data.table?

like image 330
user3664020 Avatar asked Oct 23 '15 19:10

user3664020


2 Answers

An excuse to use the obscure .BY:

DT[, newval := .SD[[.BY[[1]]]], by=new]

   col1 col2 col3  new newval
1:    1    4   55 col1      1
2:    2    3   44 col2      3
3:    3   34   35 col2     34
4:    4   44   87 col3     87

How it works. This splits the data into groups based on the strings in new. The value of the string for each group is stored in newname = .BY[[1]]. We use this string to select the corresponding column of .SD via .SD[[newname]]. .SD stands for Subset of Data.

Alternatives. get(.BY[[1]]) should work just as well in place of .SD[[.BY[[1]]]]. According to a benchmark run by @David, the two ways are equally fast.

like image 121
Frank Avatar answered Nov 12 '22 07:11

Frank


We can match the 'new' column with the column names of the dataset to get the column index, cbind with the row index (1:nrow(df1)) and extract the corresponding elements of the dataset based on row/column index. It can be assigned to a new column.

df1$matched_value <- df1[-4][cbind(1:nrow(df1),match(df1$new, colnames(df1) ))]
df1
#  col1 col2 col3  new matched_value
#1    1    4   55 col1             1
#2    2    3   44 col2             3
#3    3   34   35 col2            34
#4    4   44   87 col3            87

NOTE: If the OP have a data.table, one option is convert to data.frame or use with=FALSE while subsetting.

 setDF(df1) #to convert to 'data.frame'.

Benchmarks

set.seed(45)
df2 <- data.frame(col1= sample(1:9, 20e6, replace=TRUE),
col2= sample(1:20, 20e6, replace=TRUE), 
col3= sample(1:40, 20e6, replace=TRUE),
col4=sample(1:30, 20e6, replace=TRUE),
new= sample(paste0('col', 1:4), 20e6, replace=TRUE), stringsAsFactors=FALSE)
system.time(df2$matched_value <- df2[-5][cbind(1:nrow(df2),match(df2$new, colnames(df2) ))])
#   user  system elapsed 
#  2.54    0.37    2.92 
like image 24
akrun Avatar answered Nov 12 '22 06:11

akrun