Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - Create a new variable where each observation depends on another table and other variables in the data frame

Tags:

r

data.table

I have the two following tables:

df <- data.frame(eth = c("A","B","B","A","C"),ZIP1 = c(1,1,2,3,5))
Inc <- data.frame(ZIP2 = c(1,2,3,4,5,6,7),A = c(56,98,43,4,90,19,59), B = c(49,10,69,30,10,4,95),C = c(69,2,59,8,17,84,30))

eth    ZIP1         ZIP2    A    B    C
A      1            1      56   49   69
B      1            2      98   10   2
B      2            3      43   69   59
A      3            4      4    30   8
C      5            5      90   10   17
                    6      19   4    84
                    7      59   95   39

I would like to create a variable Inc in the df data frame where for each observation, the value is the intersection of the eth and ZIP of the observation. In my example, it would lead to:

   eth    ZIP1   Inc        
    A      1    56
    B      1    49
    B      2    10
    A      3    43
    C      5    17

A loop or quite brute force could solve it but it takes time on my dataset, I'm looking for a more subtle way maybe using data.table. It seems to me that it is a very standard question and I'm apologizing if it is, my unability to formulate a precise title for this problem (as you may have noticed..) is maybe why I haven't found any similar question in searching on the forum..

Thanks !

like image 521
Yurienu Avatar asked Nov 13 '15 23:11

Yurienu


People also ask

How do I create a new variable in R based on other variables?

Create new variables from existing variables in R?. To create new variables from existing variables, use the case when() function from the dplyr package in R.

How do I combine table variables in R?

If you want to join by multiple variables, then you need to specify a vector of variable names: by = c("var1", "var2", "var3") . Here all three columns must match in both tables. If you want to use all variables that appear in both tables, then you can leave the by argument blank.

How do you create a new variable in R?

Use the assignment operator <- to create new variables. A wide array of operators and functions are available here. (To practice working with variables in R, try the first chapter of this free interactive course.)

How do I create a new dataset from an existing dataset in R?

Create DataFrame From Existing using data. data. frame() method is used to create a DataFrame in R and also is used to create an empty DataFrame. Similarly, you can also use this to create a DataFrame by selecting subset columns and rows from an existing one.


2 Answers

Sure, it can be done in data.table:

library(data.table)
setDT(df)

df[ melt(Inc, id.var="ZIP2", variable.name="eth", value.name="Inc"), 
  Inc := i.Inc
, on=c(ZIP1 = "ZIP2","eth") ]

The syntax for this "merge-assign" operation is X[i, Xcol := expression, on=merge_cols].

You can run the i = melt(Inc, id.var="ZIP", variable.name="eth", value.name="Inc") part on its own to see how it works. Inside the merge, columns from i can be referred to with i.* prefixes.


Alternately...

setDT(df)
setDT(Inc)
df[, Inc := Inc[.(ZIP1), eth, on="ZIP2", with=FALSE], by=eth]

This is built on a similar idea. The package vignettes are a good place to start for this sort of syntax.

like image 107
Frank Avatar answered Sep 28 '22 04:09

Frank


We can use row/column indexing

df$Inc <- Inc[cbind(match(df$ZIP1, Inc$ZIP2), match(df$eth, colnames(Inc)))]

df
#  eth ZIP1 Inc
#1   A    1  56
#2   B    1  49
#3   B    2  10
#4   A    3  43
#5   C    5  17
like image 27
akrun Avatar answered Sep 28 '22 02:09

akrun