Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sample rows of subgroups from dataframe with dplyr

Tags:

r

dplyr

sample

If I want to randomly select some samples from different groups I use the plyr package and the code below

require(plyr)
sampleGroup<-function(df,size) {
  df[sample(nrow(df),size=size),]
}

iris.sample<-ddply(iris,.(Species),function(df) sampleGroup(df,10))

Here 10 samples are selected from each species.

Some of my dataframes are very big and my question is can I use the same sampleGroup function with the dplyr package? Or is there another way to do the same in dplyr?

EDIT

Version 0.2 of the dplyr package introduced two new functions to select random rows from a table sample_n and sample_frac

like image 542
Robert Avatar asked Jan 21 '14 10:01

Robert


4 Answers

Yes, you can use dplyr:

mtcars %>% 
    group_by(cyl) %>%
    slice_sample(n = 2))

and the results are like this

Source: local data frame [6 x 11]
Groups: cyl

   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
3 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
4 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
5 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
6 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

Historical note: slice_sample() replaces sample_n() in dplyr 1.0.0 (May 2020). Early versions of dplyr required do(sample_n(., 2)).

like image 100
PhilChang Avatar answered Nov 10 '22 00:11

PhilChang


This is easy to do with data.table, and useful for a big table.

NOTE: As mentioned in the coments by Troy, there is a more effiecient way of doing this using data.table, but i wanted to respect the OP sample function and format in the answer.

require(data.table)
DT <- data.table(x = rnorm(10e6, 100, 50), y = letters)

sampleGroup<-function(df,size) {
  df[sample(nrow(df),size=size),]
}

result <- DT[, sampleGroup(.SD, 10), by=y]
print(result)

# y         x y
# 1: a  30.11659 m
# 2: a  57.99974 h
# 3: a  58.13634 o
# 4: a  87.28466 x
# 5: a  85.54986 j
# ---              
# 256: z 149.85817 d
# 257: z 160.24293 e
# 258: z  26.63071 j
# 259: z  17.00083 t
# 260: z 130.27796 f

system.time(DT[, sampleGroup(.SD, 10), by=y])
# user  system elapsed 
# 0.66    0.02    0.69 

Using the iris dataset:
iris <- data.table(iris)
iris[,sampleGroup(.SD, 10), by=Species]
like image 26
marbel Avatar answered Nov 09 '22 23:11

marbel


That's a good question! Can't see any easy way to do it with the documented syntax for dplyr but how about this for a workaround?

sampleGroup<-function(df,x=1){

  df[
    unlist(lapply(attr((df),"indices"),function(r)sample(r,min(length(r),x))))
    ,]

}

sampleGroup(iris %.% group_by(Species),3)

#Source: local data frame [9 x 5]
#Groups: Species
#
#    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#39           4.4         3.0          1.3         0.2     setosa
#16           5.7         4.4          1.5         0.4     setosa
#25           4.8         3.4          1.9         0.2     setosa
#51           7.0         3.2          4.7         1.4 versicolor
#62           5.9         3.0          4.2         1.5 versicolor
#59           6.6         2.9          4.6         1.3 versicolor
#148          6.5         3.0          5.2         2.0  virginica
#103          7.1         3.0          5.9         2.1  virginica
#120          6.0         2.2          5.0         1.5  virginica

EDIT - PERFORMANCE COMPARISON

Here's a test against using data.table (both native and with a function call as per the example) for 1m rows, 26 groups.

Native data.table is about 2x as fast as the dplyr workaround and also than data.table call with callout. So probably dplyr / data.table are about the same performance.

Hopefully the dplyr guys will give us some native syntax for sampling soon! (or even better, maybe it's already there)

sampleGroup.dt<-function(df,size) {
  df[sample(nrow(df),size=size),]
}

testdata<-data.frame(group=sample(letters,10e5,T),runif(10e5))

dti<-data.table(testdata)

# using the dplyr workaround with external function call
system.time(sampleGroup(testdata %.% group_by(group),10))
#user  system elapsed 
#0.07    0.00    0.06 

#using native data.table
system.time(dti[dti[,list(val=sample(.I,10)),by="group"]$val])
#user  system elapsed 
#0.04    0.00    0.03 

#using data.table with external function call
system.time(dti[, sampleGroup.dt(dti, 10), by=group])
#user  system elapsed 
#0.06    0.02    0.08 
like image 7
Troy Avatar answered Nov 10 '22 00:11

Troy


Dplyr 1.0.2 can subset with various verbs now: https://dplyr.tidyverse.org/reference/slice.html including random slice_sample:

mtcars %>% 
  slice_sample(n = 10)

and add a group by to sample by a category:

mtcars %>% 
  group_by(cyl) %>% 
  slice_sample(n = 2)
like image 3
Zoë Turner Avatar answered Nov 09 '22 22:11

Zoë Turner