Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: How to split data into training and testing set, while preserving proportions & distributions of variables?

Tags:

r

testing

Reproducible example:

library(caTools) #for sample.split function
set.seed(123)
#Creating example data frame
example_df <- data.frame(personID = > c(stringi::stri_rand_strings(1000, 5)),
                           sex = sample(1:2, 1000, replace=TRUE),
                           age = round(rnorm(1000, mean=50, sd=15), 0))

#Example of random splitting:
training_set <- example_df[sample.split(example_df$personID),]
test_set <- example_df[-c(training_set$personID),]

#evaluation of variables in test and training data sets:
  #Has to approximate 1 (in this case it's 1.2, which is too high)
  (sum(training_set$sex == 1) / sum(training_set$sex == 2)) / (sum(test_set$sex == 1) / sum(test_set$sex == 2)) 
  [1] 1.219139
  #Has to approximate 1 along the distribution (it's quite good, this is actually what i would expect)
  summary(training_set$age) / summary(test_set$age)
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.7143  0.9756  1.0000  1.0032  1.0169  1.0000 

Although sample.split function divided age appropriately (distributions match), proportion of males and females differ significantly in sex variable. What function to use for automatic and even split of data into multiple (in this example two) sets, while preserving proportions and distributions of variables?

like image 862
juststuck Avatar asked Dec 29 '25 11:12

juststuck


1 Answers

The caret package will build balanced sets for you. Check the package vignette covering the basics. For example:

inTrain <- createDataPartition(
  y = Sonar$Class,
  ## the outcome data are needed
  p = .75,
  ## The percentage of data in the
  ## training set
  list = FALSE
)
like image 145
itsMeInMiami Avatar answered Jan 01 '26 06:01

itsMeInMiami



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!