Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Randomly sample data frame into 3 groups in R

Objective: Randomly divide a data frame into 3 samples.

  • one sample with 60% of the rows
  • other two samples with 20% of the rows
  • samples should not have duplicates of others (i.e. sample without replacement).

Here's a clunky solution:

allrows <- 1:nrow(mtcars)

set.seed(7)
trainrows <- sample(allrows, replace = F, size = 0.6*length(allrows))
test_cvrows <- allrows[-trainrows]
testrows <- sample(test_cvrows, replace=F, size = 0.5*length(test_cvrows))
cvrows <- test_cvrows[-which(test_cvrows %in% testrows)]

train <- mtcars[trainrows,]
test <- mtcars[testrows,]
cvr <- mtcars[cvrows,]

There must be something easier, perhaps in a package. dplyr has the sample_frac function, but that seems to target a single sample, not a split into multiple.

Close, but not quite the answer to this question: Random Sample with multiple probabilities in R

like image 585
Minnow Avatar asked Dec 01 '15 19:12

Minnow


People also ask

How do you randomly select samples in R?

Sample_n() function is used to select n random rows from a dataframe in R.

How do you split data into a group in R?

Split() is a built-in R function that divides a vector or data frame into groups according to the function's parameters. It takes a vector or data frame as an argument and divides the information into groups. The syntax for this function is as follows: split(x, f, drop = FALSE, ...)

How does sample work in R?

Sample() function is used to generate the random elements from the given data with or without replacement. where, data can be a vector or a dataframe. size represents the size of the sample.


2 Answers

Do you need the partitioning to be exact? If not,

set.seed(7)
ss <- sample(1:3,size=nrow(mtcars),replace=TRUE,prob=c(0.6,0.2,0.2))
train <- mtcars[ss==1,]
test <- mtcars[ss==2,]
cvr <- mtcars[ss==3,]

should do it.

Or, as @Frank says in comments, you can split() the original data to keep them as elements of a list:

mycars <- setNames(split(mtcars,ss), c("train","test","cvr"))
like image 156
Ben Bolker Avatar answered Oct 21 '22 23:10

Ben Bolker


Not the prettiest solution (especially for larger samples), but it works.

n = nrow(mtcars)
#use different rounding for differet sizes/proportions
times =rep(1:3,c(0.6*n,0.2*n,0.2*n))
ntimes = length(times)
if (ntimes < n)
    times = c(times,sample(1:3,n-ntimes,prob=c(0.6,0.2,0.2),replace=FALSE))
sets = sample(times)
df1 = mtcars[sets==1,]
df2 = mtcars[sets==2,]
df3 = mtcars[sets==3,]
like image 34
Max Candocia Avatar answered Oct 21 '22 23:10

Max Candocia