Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Match by id and divide column values across two dataframes

Tags:

dataframe

r

dplyr

frames:

df 1: contains multiple rows of the same id with 500 column values

    id|val.1|val.2|...|val.500
---------------------------------
    1 | 240 | 234 |...|228
    1 | 224 | 222 |...|230
    1 | 238 | 240 |...|240
    2 | 277 | 270 |...|255
    2 | 291 | 290 |...|265
    2 | 284 | 282 |...|285

df 2: contains only one unique id (row) that matches df-1 id column with 500 column values

    id|val.1|val.2|...|val.500
---------------------------------
    1 | 250 | 240 |...|245
    2 | 280 | 282 |...|281

I would like to divide df 1 column values by the corresponding column value in df 2 based on their id to end up with a df 3:

    id|val.1|val.2|...|val.500
---------------------------------
    1 | 0.96| 0.98|...|0.93
    1 | 0.90| 0.93|...|0.94
    1 | 0.95| 1.00|...|0.98
    2 | 0.99| 0.96|...|0.91
    2 | 1.04| 1.03|...|0.94
    2 | 1.01| 1.00|...|1.01

Basically weighting df 1 values by df 2 based on their id and column value. I've been scratching my head for a while now about the best way to go about this, and not making much progress. any guidance would be greatly appreciated. Thanks

like image 674
Anand Roopsind Avatar asked Jan 03 '17 20:01

Anand Roopsind


2 Answers

Two possible approaches:

1: 'wide'-approach

With the dplyr and purrr packages:

library(dplyr)
library(purrr)

df12 <- left_join(df1, df2, by = 'id')
cbind(id=df12[,1], map2_df(df12[,2:4], df12[,5:7], `/`))

With the data.table package (method borrowed from here):

library(data.table)

# convert to 'data.tables'
setDT(df1)
setDT(df2)

# creates two vectors of matching columnnames
xcols = names(df1)[-1]
icols = paste0("i.", xcols)

# join and do the calculation
df1[df2, on = 'id', Map('/', mget(xcols), mget(icols)), by = .EACHI]

which both give:

   id     val.1     val.2     val.3
1:  1 0.9600000 0.9750000 0.9306122
2:  1 0.8960000 0.9250000 0.9387755
3:  1 0.9520000 1.0000000 0.9795918
4:  2 0.9892857 0.9574468 0.9074733
5:  2 1.0392857 1.0283688 0.9430605
6:  2 1.0142857 1.0000000 1.0142349

2: 'long'-approach

Another option is to reshape your dataframes into long format, then merge/join them and do the calculation.

With the data.table-package:

library(data.table)

dt1 <- melt(setDT(df1), id = 1)
dt2 <- melt(setDT(df2), id = 1)

dt1[dt2, on = c('id','variable'), value := value/i.value][]

With the dplyr and tidyr packages:

library(dplyr)
library(tidyr)

df1 %>% 
  gather(variable, value, -id) %>% 
  left_join(., df2 %>% gather(variable, value, -id), by = c('id','variable')) %>% 
  mutate(value = value.x/value.y) %>% 
  select(id, variable, value)

which both give:

    id variable     value
 1:  1    val.1 0.9600000
 2:  1    val.1 0.8960000
 3:  1    val.1 0.9520000
 4:  2    val.1 0.9892857
 5:  2    val.1 1.0392857
 6:  2    val.1 1.0142857
 7:  1    val.2 0.9750000
 8:  1    val.2 0.9250000
 9:  1    val.2 1.0000000
10:  2    val.2 0.9574468
11:  2    val.2 1.0283688
12:  2    val.2 1.0000000
13:  1    val.3 0.9306122
14:  1    val.3 0.9387755
15:  1    val.3 0.9795918
16:  2    val.3 0.9074733
17:  2    val.3 0.9430605
18:  2    val.3 1.0142349

Used data:

df1 <- structure(list(id = c(1, 1, 1, 2, 2, 2), val.1 = c(240, 224, 238, 277, 291, 284), 
                      val.2 = c(234, 222, 240, 270, 290, 282), val.3 = c(228, 230, 240, 255, 265, 285)), 
                 .Names = c("id", "val.1", "val.2", "val.3"), class = "data.frame", row.names = c(NA, -6L))

df2 <- structure(list(id = c(1, 2), val.1 = c(250, 280), val.2 = c(240, 282), val.3 = c(245, 281)),
                 .Names = c("id", "val.1", "val.2", "val.3"), class = "data.frame", row.names = c(NA, -2L))
like image 59
Jaap Avatar answered Oct 24 '22 00:10

Jaap


As long as the data.frames are ordered properly by column and both have the same columns, then I think the following base R code will accomplish what you want.

cbind(df1[1], df1[-1] / df2[match(df1$id, df2$id), -1])

  id     val.1     val.2   val.500
1  1 0.9600000 0.9750000 0.9306122
2  1 0.8960000 0.9250000 0.9387755
3  1 0.9520000 1.0000000 0.9795918
4  2 0.9892857 0.9574468 0.9074733
5  2 1.0392857 1.0283688 0.9430605
6  2 1.0142857 1.0000000 1.0142349

Here, match(df1$id, df2$id) will return the row indices of df1 that correspond to the ids of df2, so df2[match(df1$id, df2$id), -1] will return the corresponding rows of df2 as a data.frame with the id variable removed. This data.frame then matches df1 in shape when the id variable is removed and df1[-1] / df2[match(df1$id, df2$id), -1] performs the division. Finally cbind prepends the id variable to the final data.frame.

data

df1 <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L), val.1 = c(240L, 
224L, 238L, 277L, 291L, 284L), val.2 = c(234L, 222L, 240L, 270L, 
290L, 282L), val.500 = c(228L, 230L, 240L, 255L, 265L, 285L)), .Names = c("id", 
"val.1", "val.2", "val.500"), class = "data.frame", row.names = c(NA, 
-6L))

df2 <- structure(list(id = 1:2, val.1 = c(250L, 280L), val.2 = c(240L, 
282L), val.500 = c(245L, 281L)), .Names = c("id", "val.1", "val.2", 
"val.500"), class = "data.frame", row.names = c(NA, -2L))
like image 40
lmo Avatar answered Oct 24 '22 01:10

lmo