Sample random rows within each group in a data.table

Tags:

3 Answers

Maybe something like this?

> DT[,.SD[sample(.N, min(3,.N))],by = a]    a   b 1: 1 744 2: 1 497 3: 1 167 4: 2 888 5: 2 950 6: 2 343

(Thanks to Josh for the correction, below.)

119

answered Sep 22 '22 13:09

I believe joran's answer can be further generalized. The details are here (How do you sample groups in a data.table with a caveat) but I believe this solution accounts for cases where there aren't "3" rows to sample from.

The current solution will error out when it tries to sample "x" times from rows that have less than "x" common values. In the below case, x=3. And it takes into consideration this caveat. (Solution done by nrussell)

set.seed(123)
##
DT <- data.table(
  a=c(1,1,1,1:15,1,1), 
  b=sample(1:1000,20))
##
R> DT[,.SD[sample(.N,min(.N,3))],by = a]
     a   b
 1:  1 288
 2:  1 881
 3:  1 409
 4:  2 937
 5:  3  46
 6:  4 525
 7:  5 887
 8:  6 548
 9:  7 453
10:  8 948
11:  9 449
12: 10 670
13: 11 566
14: 12 102
15: 13 993
16: 14 243
17: 15  42

answered Sep 22 '22 13:09

road_to_quantdom

There are two subtle considerations that impact the answer to this question, and these are mentioned by Josh O'Brien and Valentin in comments. The first is that subsetting via .SD is very inefficient, and it is better to sample .I directly (see the benchmark below).

The second consideration, if we do sample from .I, is that calling sample(.I, size = 1) leads to unexpected behavior when .I > 1 and length(.I) = 1. In this case, sample() behaves as if we called sample(1:.I, size = 1), which is surely not what we want. As Valentin notes, it's better to use the construct .I[sample(.N, size = 1)] in this case.

As a benchmark, we build a simple 1,000 x 1 data.table and sample randomly per group. Even with such a small data.table the .I method is roughly 20x faster.

library(microbenchmark)
library(data.table)

set.seed(1L)
DT <- data.table(id = sample(1e3, 1e3, replace = TRUE))

microbenchmark(
  `.I` = DT[DT[, .I[sample(.N, 1)], by = id][[2]]],
  `.SD` = DT[, .SD[sample(.N, 1)], by = id]
)
#> Unit: milliseconds
#>  expr       min        lq     mean    median        uq       max neval
#>    .I  2.396166  2.588275  3.22504  2.794152  3.118135  19.73236   100
#>   .SD 55.798177 59.152000 63.72131 61.213650 64.205399 102.26781   100

^{Created on 2020-12-02 by the reprex package (v0.3.0)}

answered Sep 25 '22 13:09

tomshafer

Related questions
                            
                                What is the practical use of the identity function in R?
                            
                                Is there a way to use two '...' statements in a function in R?
                            
                                Aesthetics must either be length one, or the same length as the dataProblems
                            
                                How can I match fuzzy match strings from two datasets?
                            
                                Renaming Objects in RStudio context sensitive within entire Project
                            
                                R Markdown Bullet List with Multiple Levels
                            
                                How to highlight time ranges on a plot?
                            
                                Output in R, Avoid Writing "[1]"
                            
                                How can I stop a running R command in linux other than with Ctrl + C?
                            
                                How to convert dataframe into time series?
                            
                                Categorize continuous variable with dplyr [duplicate]
                            
                                R self reference
                            
                                figure captions, references using knitr and markdown to html
                            
                                What are the double colons (::) in R?
                            
                                Why can't I get a p-value smaller than 2.2e-16?
                            
                                R - Finding closest neighboring point and number of neighbors within a given radius, coordinates lat-long
                            
                                How to skip error checking at Rmarkdown compiling?
                            
                                Get row and column indices of matches using `which()`
                            
                                Using identical() in R with multiple vectors
                            
                                Use of lapply .SD in data.table R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Sample random rows within each group in a data.table

Tags:

r

data.table

Christopher Manning

People also ask

3 Answers

joran

road_to_quantdom

tomshafer

Recent Activity

Donate For Us