Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr + "meta"-columns: when a column contains names of other columns to use instead of the data

Tags:

r

dplyr

I wonder if the following question has an elegant solution in dplyr.

To provide a simple reproducible example, consider the following data.frame:

df <- data.frame( a=1:5, b=2:6, c=3:7,
                  ref=c("a","a","b","b","c"), 
                  stringsAsFactors = FALSE )

Here a,b,c are regular numeric variables while ref is meant to reference which column is the "main" value for that observation. For example:

  a b c ref
1 1 2 3   a
2 2 3 4   a
3 3 4 5   b
4 4 5 6   b
5 5 6 7   c

For example, for observation 3, ref==b and thus column b contains the main value. While for observation 1, ref==a and thus column a contains the main value.

Having this data.frame the question is to create the new column with main values for each observation using dplyr.

  a b c ref main
1 1 2 3   a    1
2 2 3 4   a    2
3 3 4 5   b    4
4 4 5 6   b    5
5 5 6 7   c    7

I'll probably need to use dplyr for that since this one operation is a part of a longer dplyr %>% data transformation chain.

like image 770
akhmed Avatar asked Dec 09 '22 03:12

akhmed


1 Answers

Here's a simple, fast way that allows you to stick with dplyr chaining:

require(data.table)
df %>% setDT %>% .[,main:=get(ref),by=ref]
#    a b c ref main
# 1: 1 2 3   a    1
# 2: 2 3 4   a    2
# 3: 3 4 5   b    4
# 4: 4 5 6   b    5
# 5: 5 6 7   c    7

Thanks to @akrun for the idea for the fastest way and benchmarking to show it (see his answer).

setDT modifies the class of df so you won't have to convert to data.table again in future chains.


The conversion should work with any future code in the chain, but both dplyr and data.table are under active development, so to be on the safe side, one could instead use

df %>% data.table %>% .[,main:=get(ref),by=ref]
like image 176
Frank Avatar answered Dec 10 '22 16:12

Frank