If I want to randomly select some samples from different groups I use the plyr package and the code below
require(plyr)
sampleGroup<-function(df,size) {
df[sample(nrow(df),size=size),]
}
iris.sample<-ddply(iris,.(Species),function(df) sampleGroup(df,10))
Here 10 samples are selected from each species.
Some of my dataframes are very big and my question is can I use the same sampleGroup function with the dplyr package? Or is there another way to do the same in dplyr?
EDIT
Version 0.2 of the dplyr package introduced two new functions to select random rows from a table sample_n and sample_frac
Yes, you can use dplyr:
mtcars %>%
group_by(cyl) %>%
slice_sample(n = 2))
and the results are like this
Source: local data frame [6 x 11]
Groups: cyl
mpg cyl disp hp drat wt qsec vs am gear carb
1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
3 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
4 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
5 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
6 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Historical note: slice_sample()
replaces sample_n()
in dplyr 1.0.0 (May 2020). Early versions of dplyr required do(sample_n(., 2))
.
This is easy to do with data.table, and useful for a big table.
NOTE: As mentioned in the coments by Troy, there is a more effiecient way of doing this using data.table, but i wanted to respect the OP sample function and format in the answer.
require(data.table)
DT <- data.table(x = rnorm(10e6, 100, 50), y = letters)
sampleGroup<-function(df,size) {
df[sample(nrow(df),size=size),]
}
result <- DT[, sampleGroup(.SD, 10), by=y]
print(result)
# y x y
# 1: a 30.11659 m
# 2: a 57.99974 h
# 3: a 58.13634 o
# 4: a 87.28466 x
# 5: a 85.54986 j
# ---
# 256: z 149.85817 d
# 257: z 160.24293 e
# 258: z 26.63071 j
# 259: z 17.00083 t
# 260: z 130.27796 f
system.time(DT[, sampleGroup(.SD, 10), by=y])
# user system elapsed
# 0.66 0.02 0.69
Using the iris dataset:
iris <- data.table(iris)
iris[,sampleGroup(.SD, 10), by=Species]
That's a good question! Can't see any easy way to do it with the documented syntax for dplyr
but how about this for a workaround?
sampleGroup<-function(df,x=1){
df[
unlist(lapply(attr((df),"indices"),function(r)sample(r,min(length(r),x))))
,]
}
sampleGroup(iris %.% group_by(Species),3)
#Source: local data frame [9 x 5]
#Groups: Species
#
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#39 4.4 3.0 1.3 0.2 setosa
#16 5.7 4.4 1.5 0.4 setosa
#25 4.8 3.4 1.9 0.2 setosa
#51 7.0 3.2 4.7 1.4 versicolor
#62 5.9 3.0 4.2 1.5 versicolor
#59 6.6 2.9 4.6 1.3 versicolor
#148 6.5 3.0 5.2 2.0 virginica
#103 7.1 3.0 5.9 2.1 virginica
#120 6.0 2.2 5.0 1.5 virginica
EDIT - PERFORMANCE COMPARISON
Here's a test against using data.table (both native and with a function call as per the example) for 1m rows, 26 groups.
Native data.table is about 2x as fast as the dplyr workaround and also than data.table call with callout. So probably dplyr / data.table are about the same performance.
Hopefully the dplyr guys will give us some native syntax for sampling soon! (or even better, maybe it's already there)
sampleGroup.dt<-function(df,size) {
df[sample(nrow(df),size=size),]
}
testdata<-data.frame(group=sample(letters,10e5,T),runif(10e5))
dti<-data.table(testdata)
# using the dplyr workaround with external function call
system.time(sampleGroup(testdata %.% group_by(group),10))
#user system elapsed
#0.07 0.00 0.06
#using native data.table
system.time(dti[dti[,list(val=sample(.I,10)),by="group"]$val])
#user system elapsed
#0.04 0.00 0.03
#using data.table with external function call
system.time(dti[, sampleGroup.dt(dti, 10), by=group])
#user system elapsed
#0.06 0.02 0.08
Dplyr 1.0.2 can subset with various verbs now: https://dplyr.tidyverse.org/reference/slice.html including random slice_sample:
mtcars %>%
slice_sample(n = 10)
and add a group by to sample by a category:
mtcars %>%
group_by(cyl) %>%
slice_sample(n = 2)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With