I'm trying to partition a data set that I have in R, 2/3 for training and 1/3 for testing. I have one classification variable, and seven numerical variables. Each observation is classified as either A, B, C, or D.
For simplicity's sake, let's say that the classification variable, cl, is A for the first 100 observations, B for observations 101 to 200, C till 300, and D till 400. I'm trying to get a partition that has 2/3 of the observations for each of A, B, C, and D (as opposed to simply getting 2/3 of the observations for the entire data set since it will likely not have equal amounts of each classification).
When I try to sample from a subset of the data, such as sample(subset(data, cl=='A'))
, the columns are reordered instead of the rows.
To summarize, my goal is to have 67 random observations from each of A, B, C, and D as my training data, and store the remaining 33 observations for each of A, B, C, and D as testing data. I have found a very similar question to mine, but it did not factor in multiple variables.
There is actually a nice package caret for dealing with machine learning problems and it contains a function createDataPartition() that pretty much does this sampling 2/3rds from each level of a supplied factor:
#2/3rds for training
library(caret)
inTrain = createDataPartition(df$yourFactor, p = 2/3, list = FALSE)
dfTrain=df[inTrain,]
dfTest=df[-inTrain,]
this may be longer but i think it's more intuitive and can be done in base R ;)
# create the data frame you've described
x <-
data.frame(
cl =
c(
rep( 'A' , 100 ) ,
rep( 'B' , 100 ) ,
rep( 'C' , 100 ) ,
rep( 'D' , 100 )
) ,
othernum1 = rnorm( 400 ) ,
othernum2 = rnorm( 400 ) ,
othernum3 = rnorm( 400 ) ,
othernum4 = rnorm( 400 ) ,
othernum5 = rnorm( 400 ) ,
othernum6 = rnorm( 400 ) ,
othernum7 = rnorm( 400 )
)
# sample 67 training rows within classification groups
training.rows <-
tapply(
# numeric vector containing the numbers
# 1 to nrow( x )
1:nrow( x ) ,
# break the sample function out by
# the classification variable
x$cl ,
# use the sample function within
# each classification variable group
sample ,
# send the size = 67 parameter
# through to the sample() function
size = 67
)
# convert your list back to a numeric vector
tr <- unlist( training.rows )
# split your original data frame into two:
# all the records sampled as training rows
training.df <- x[ tr , ]
# all other records (NOT sampled as training rows)
testing.df <- x[ -tr , ]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With