sample rows of subgroups from dataframe with dplyr

Question

If I want to randomly select some samples from different groups I use the plyr package and the code below

require(plyr)
sampleGroup<-function(df,size) {
  df[sample(nrow(df),size=size),]
}

iris.sample<-ddply(iris,.(Species),function(df) sampleGroup(df,10))

Here 10 samples are selected from each species.

Some of my dataframes are very big and my question is can I use the same sampleGroup function with the dplyr package? Or is there another way to do the same in dplyr?

EDIT

Version 0.2 of the dplyr package introduced two new functions to select random rows from a table sample_n and sample_frac

PhilChang · Accepted Answer

Yes, you can use dplyr:

mtcars %>% 
    group_by(cyl) %>%
    slice_sample(n = 2))

and the results are like this

Source: local data frame [6 x 11]
Groups: cyl

   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
3 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
4 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
5 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
6 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

Historical note: slice_sample() replaces sample_n() in dplyr 1.0.0 (May 2020). Early versions of dplyr required do(sample_n(., 2)).

marbel · Answer

This is easy to do with data.table, and useful for a big table.

NOTE: As mentioned in the coments by Troy, there is a more effiecient way of doing this using data.table, but i wanted to respect the OP sample function and format in the answer.

require(data.table)
DT <- data.table(x = rnorm(10e6, 100, 50), y = letters)

sampleGroup<-function(df,size) {
  df[sample(nrow(df),size=size),]
}

result <- DT[, sampleGroup(.SD, 10), by=y]
print(result)

# y         x y
# 1: a  30.11659 m
# 2: a  57.99974 h
# 3: a  58.13634 o
# 4: a  87.28466 x
# 5: a  85.54986 j
# ---              
# 256: z 149.85817 d
# 257: z 160.24293 e
# 258: z  26.63071 j
# 259: z  17.00083 t
# 260: z 130.27796 f

system.time(DT[, sampleGroup(.SD, 10), by=y])
# user  system elapsed 
# 0.66    0.02    0.69 

Using the iris dataset:
iris <- data.table(iris)
iris[,sampleGroup(.SD, 10), by=Species]

Troy · Answer

That's a good question! Can't see any easy way to do it with the documented syntax for dplyr but how about this for a workaround?

sampleGroup<-function(df,x=1){

  df[
    unlist(lapply(attr((df),"indices"),function(r)sample(r,min(length(r),x))))
    ,]

}

sampleGroup(iris %.% group_by(Species),3)

#Source: local data frame [9 x 5]
#Groups: Species
#
#    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#39           4.4         3.0          1.3         0.2     setosa
#16           5.7         4.4          1.5         0.4     setosa
#25           4.8         3.4          1.9         0.2     setosa
#51           7.0         3.2          4.7         1.4 versicolor
#62           5.9         3.0          4.2         1.5 versicolor
#59           6.6         2.9          4.6         1.3 versicolor
#148          6.5         3.0          5.2         2.0  virginica
#103          7.1         3.0          5.9         2.1  virginica
#120          6.0         2.2          5.0         1.5  virginica

EDIT - PERFORMANCE COMPARISON

Here's a test against using data.table (both native and with a function call as per the example) for 1m rows, 26 groups.

Native data.table is about 2x as fast as the dplyr workaround and also than data.table call with callout. So probably dplyr / data.table are about the same performance.

Hopefully the dplyr guys will give us some native syntax for sampling soon! (or even better, maybe it's already there)

sampleGroup.dt<-function(df,size) {
  df[sample(nrow(df),size=size),]
}

testdata<-data.frame(group=sample(letters,10e5,T),runif(10e5))

dti<-data.table(testdata)

# using the dplyr workaround with external function call
system.time(sampleGroup(testdata %.% group_by(group),10))
#user  system elapsed 
#0.07    0.00    0.06 

#using native data.table
system.time(dti[dti[,list(val=sample(.I,10)),by="group"]$val])
#user  system elapsed 
#0.04    0.00    0.03 

#using data.table with external function call
system.time(dti[, sampleGroup.dt(dti, 10), by=group])
#user  system elapsed 
#0.06    0.02    0.08

Zoë Turner · Answer

Dplyr 1.0.2 can subset with various verbs now: https://dplyr.tidyverse.org/reference/slice.html including random slice_sample:

mtcars %>% 
  slice_sample(n = 10)

and add a group by to sample by a category:

mtcars %>% 
  group_by(cyl) %>% 
  slice_sample(n = 2)

sample rows of subgroups from dataframe with dplyr

Tags:

r

dplyr

sample

Robert

4 Answers

PhilChang

marbel

Troy

Zoë Turner

Recent Activity

Donate For Us

sample rows of subgroups from dataframe with dplyr

Tags:

r

dplyr

sample

Robert

4 Answers

PhilChang

marbel

Troy

Zoë Turner

Related questions

Recent Activity

Donate For Us