Reproducible example:
library(caTools) #for sample.split function
set.seed(123)
#Creating example data frame
example_df <- data.frame(personID = > c(stringi::stri_rand_strings(1000, 5)),
sex = sample(1:2, 1000, replace=TRUE),
age = round(rnorm(1000, mean=50, sd=15), 0))
#Example of random splitting:
training_set <- example_df[sample.split(example_df$personID),]
test_set <- example_df[-c(training_set$personID),]
#evaluation of variables in test and training data sets:
#Has to approximate 1 (in this case it's 1.2, which is too high)
(sum(training_set$sex == 1) / sum(training_set$sex == 2)) / (sum(test_set$sex == 1) / sum(test_set$sex == 2))
[1] 1.219139
#Has to approximate 1 along the distribution (it's quite good, this is actually what i would expect)
summary(training_set$age) / summary(test_set$age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.7143 0.9756 1.0000 1.0032 1.0169 1.0000
Although sample.split function divided age appropriately (distributions match), proportion of males and females differ significantly in sex variable. What function to use for automatic and even split of data into multiple (in this example two) sets, while preserving proportions and distributions of variables?
The caret package will build balanced sets for you. Check the package vignette covering the basics. For example:
inTrain <- createDataPartition(
y = Sonar$Class,
## the outcome data are needed
p = .75,
## The percentage of data in the
## training set
list = FALSE
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With