Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sample random rows within each group in a data.table

Tags:

r

data.table

How would you use data.table to efficiently take a sample of rows within each group in a data frame?

DT = data.table(a = sample(1:2), b = sample(1:1000,20))
DT
    a   b
 1: 2 562
 2: 1 183
 3: 2 180
 4: 1 874
 5: 2 533
 6: 1  21
 7: 2  57
 8: 1  20
 9: 2  39
10: 1 948
11: 2 799
12: 1 893
13: 2 993
14: 1  69
15: 2 906
16: 1 347
17: 2 969
18: 1 130
19: 2 118
20: 1 732

I was thinking of something like: DT[ , sample(??, 3), by = a] that would return a sample of three rows for each "a" (the order of the returned rows isn't significant):

    a   b
 1: 2 180
 2: 2  57
 3: 2 799
 4: 1  69
 5: 1 347
 6: 1 732
like image 855
Christopher Manning Avatar asked Apr 29 '13 22:04

Christopher Manning


People also ask

How do I select a random row from a table in R?

Sample_n() function is used to select n random rows from a dataframe in R.


3 Answers

Maybe something like this?

> DT[,.SD[sample(.N, min(3,.N))],by = a]    a   b 1: 1 744 2: 1 497 3: 1 167 4: 2 888 5: 2 950 6: 2 343 

(Thanks to Josh for the correction, below.)

like image 119
joran Avatar answered Sep 22 '22 13:09

joran


I believe joran's answer can be further generalized. The details are here (How do you sample groups in a data.table with a caveat) but I believe this solution accounts for cases where there aren't "3" rows to sample from.

The current solution will error out when it tries to sample "x" times from rows that have less than "x" common values. In the below case, x=3. And it takes into consideration this caveat. (Solution done by nrussell)

set.seed(123)
##
DT <- data.table(
  a=c(1,1,1,1:15,1,1), 
  b=sample(1:1000,20))
##
R> DT[,.SD[sample(.N,min(.N,3))],by = a]
     a   b
 1:  1 288
 2:  1 881
 3:  1 409
 4:  2 937
 5:  3  46
 6:  4 525
 7:  5 887
 8:  6 548
 9:  7 453
10:  8 948
11:  9 449
12: 10 670
13: 11 566
14: 12 102
15: 13 993
16: 14 243
17: 15  42
like image 26
road_to_quantdom Avatar answered Sep 22 '22 13:09

road_to_quantdom


There are two subtle considerations that impact the answer to this question, and these are mentioned by Josh O'Brien and Valentin in comments. The first is that subsetting via .SD is very inefficient, and it is better to sample .I directly (see the benchmark below).

The second consideration, if we do sample from .I, is that calling sample(.I, size = 1) leads to unexpected behavior when .I > 1 and length(.I) = 1. In this case, sample() behaves as if we called sample(1:.I, size = 1), which is surely not what we want. As Valentin notes, it's better to use the construct .I[sample(.N, size = 1)] in this case.

As a benchmark, we build a simple 1,000 x 1 data.table and sample randomly per group. Even with such a small data.table the .I method is roughly 20x faster.

library(microbenchmark)
library(data.table)

set.seed(1L)
DT <- data.table(id = sample(1e3, 1e3, replace = TRUE))

microbenchmark(
  `.I` = DT[DT[, .I[sample(.N, 1)], by = id][[2]]],
  `.SD` = DT[, .SD[sample(.N, 1)], by = id]
)
#> Unit: milliseconds
#>  expr       min        lq     mean    median        uq       max neval
#>    .I  2.396166  2.588275  3.22504  2.794152  3.118135  19.73236   100
#>   .SD 55.798177 59.152000 63.72131 61.213650 64.205399 102.26781   100

Created on 2020-12-02 by the reprex package (v0.3.0)

like image 34
tomshafer Avatar answered Sep 25 '22 13:09

tomshafer