Join data frames and select random row when there are multiple matches

Question

I have a reference data frame (df1) with three columns of "characteristics" (gender, year, code), and two columns of "values" (amount, status). It looks like this, but with many rows:

gender    year    code    amount   status
     M    2011       A        15      EMX
     M    2011       A       123      NOX
     F    2015       B         0      MIX
     F    2018       A        12      NOX
     F    2015       B        11      NOX

I have another data frame (df2) that just has the three "characteristics" columns. For example:

gender    year   code
     M    2011      A
     M    2011      A
     F    2018      A
     F    2015      B

For each row in df2, I want to assign "values" based on matches in "characteristics" to df1. Where there are multiple matches, I want to select pairs of "values" at random. So when there are duplicate "characteristics" in df2, they might end up with different pairs of "values", but all of them will have an exact match in df1. Essentially, for each combination of characteristics, I want the distribution of values to match between the two tables.

For example, the last row in 'df2' (gender = F, year = 2015, code = B) matches two rows in 'df1': the third row (amont = 0, status = MIX) and the fifth row (amount = 11, status = NOX). Then one of these matching rows should be selected randomly. For all such cases of multiple matches between 'df2' and 'df1' based on gender, year and code, a random row should be selected.

So far, my approach has been to start by using dplyr to do a left_join between the two data frames. However, this provides all possible "values" for each row in df2, rather than selecting one at random. So I then have to group by characteristics and select one. This produces a very large intermediate table and doesn't seem very efficient.

I wondered if anyone had suggestions for a more efficient method? I've previously found that joining with the data.table package is quicker, but don't really have a good understanding of the package. I also wonder if I should be doing joins at all, or should just use the sample function?

Any help much appreciated.

Henrik · Accepted Answer

Use 'd2' to lookup rows in 'd1' based on matches in 'gender', 'year', 'code' (d1[d2, on = .(gender, year, code), ...]). For each match (by = .EACHI), sample one row (sample(.N, 1L)). Use this to index 'amount' and 'status'.

d1[d2, on = .(gender, year, code),
  {ri <- sample(.N, 1L)
  .(amount = amount[ri], status = status[ri])}, by = .EACHI]

# sample based on set.seed(1)
#    gender year code amount status
# 1:      M 2011    A     15    EMX
# 2:      M 2011    A     15    EMX
# 3:      F 2018    A     12    NOX
# 4:      F 2015    B     11    NOX

Note that there is an open issue on Enhanced functionality of mult argument, i.e. how to handle cases when multiple rows in x match to the row in i. Currently, valid options are "all" (default), "first" or "last". But if/when the issue is implemented, mult = "random" (sample(.N, size = 1L)) may be used to select a random row (rows) among the matches.

Lyngbakr · Answer

My data.table game is pretty weak, but here's a potential solution using an approach similar to that you describe above. First, I define the data frames.

# Define data frames
df1 <- read.table(text= "gender    year    code    amount   status
M    2011       A        15      EMX
M    2011       A       123      NOX
F    2015       B         0      MIX
F    2018       A        12      NOX
F    2015       B        11      NOX", header = TRUE)

df2 <- read.table(text = "gender    year   code
     M    2011      A
     M    2011      A
     F    2018      A
     F    2015      B", header = TRUE)

Then, I set the random number generator seed for reproducibility and load the library.

# Set RNG seed
set.seed(4)

# Load library
library(data.table)

Next, I convert data frames to data tables.

# Convert to data tables
dt1 <- data.table(df1) 
dt2 <- data.table(df2)

Here, I do the actual joins, etc. I've done it 5 times in a loop to show the randomness of the results.

for(i in c(1:5)){
  # Add row numbers
  dt3 <- dt2[, rn :=.I
             ][dt1,on = .(gender, year, code)
               ][, .SD[sample(.N)[1]], .(gender, year, code, rn)
                 ][, rn := NULL]

  # Check results
  print(dt3)
}
#>    gender year code amount status
#> 1:      M 2011    A    123    NOX
#> 2:      M 2011    A     15    EMX
#> 3:      F 2015    B      0    MIX
#> 4:      F 2018    A     12    NOX
#>    gender year code amount status
#> 1:      M 2011    A    123    NOX
#> 2:      M 2011    A    123    NOX
#> 3:      F 2015    B     11    NOX
#> 4:      F 2018    A     12    NOX
#>    gender year code amount status
#> 1:      M 2011    A    123    NOX
#> 2:      M 2011    A    123    NOX
#> 3:      F 2015    B     11    NOX
#> 4:      F 2018    A     12    NOX
#>    gender year code amount status
#> 1:      M 2011    A     15    EMX
#> 2:      M 2011    A     15    EMX
#> 3:      F 2015    B     11    NOX
#> 4:      F 2018    A     12    NOX
#>    gender year code amount status
#> 1:      M 2011    A    123    NOX
#> 2:      M 2011    A     15    EMX
#> 3:      F 2015    B      0    MIX
#> 4:      F 2018    A     12    NOX

^{Created on 2019-06-12 by the reprex package (v0.3.0)}

What I actually do is add row numbers to the data table, which will help me pare down the final data table. I join the data tables and then group all of the rows that originated from a single row in dt2 and pull one at random using sample. (This bit of code is borrowed from @akrun here.) Finally, I drop the row number column.

Join data frames and select random row when there are multiple matches

Tags:

join

r

data.table

dplyr

rw2

2 Answers

Henrik

Lyngbakr

Recent Activity

Donate For Us

Join data frames and select random row when there are multiple matches

Tags:

join

r

data.table

dplyr

rw2

2 Answers

Henrik

Lyngbakr

Related questions

Recent Activity

Donate For Us