Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sampling by group without repetition using data.table

Tags:

r

data.table

I'll use a hypothetical scenario to illustrate the question. Here's a table with musicians and the instrument they play and a table with the composition for a band:

musicians <- data.table(
  instrument = rep(c('bass','drums','guitar'), each = 4),
  musician = c('Chas','John','Paul','Stuart','Andy','Paul','Peter','Ringo','George','John','Paul','Ringo')
)

band.comp <- data.table(
  instrument = c('bass','drums','guitar'),
  n = c(2,1,2)
)

To avoid arguments about who is best with which instrument, the band will be assembled by sortition. Here's how I'm doing:

musicians[band.comp, on = 'instrument'][, sample(musician, n), by = instrument]

   instrument     V1
1:       bass   Paul
2:       bass   Chas
3:      drums   Andy
4:     guitar   Paul
5:     guitar George

The problem is: since there are musicians who play more than one instrument, it can happen that one person is drawn more than once.

One can build a for loop that, for each subsequent subset of instruments, draws musicians and then eliminates those from the rest of the table. But I would like suggestions on how to do this using data.table. Mainly because the kind of problem I need to solve in real life with this logic involves data bases with hundreds of thousands of rows. And also because I'm trying to better understand the data.table syntax.

As a reference, I tried some tips from Andrew Brooks blog, but couldn't come up with a solution.

like image 559
Carlos Eduardo Lagosta Avatar asked Jan 03 '23 01:01

Carlos Eduardo Lagosta


1 Answers

This can be a solution, first you select an instrument by musician and then you select the musicians of your sample. But it may be that when selecting an instrument per musician your sample size is larger than the population then you will get an error (but in your real data this may not be a problem).

musicians[, .(instrument = sample(instrument, 1)), by = musician][band.comp, on = 'instrument'][, sample(musician, n), by = instrument]
like image 147
Juan Antonio Roldán Díaz Avatar answered Jan 16 '23 04:01

Juan Antonio Roldán Díaz