How to split data into training/testing sets using sample function

People also ask

Which function is used for splitting the dataset in training and testing samples?

Using train_test_split() from the data science library scikit-learn, you can split your dataset into subsets that minimize the potential for bias in your evaluation and validation process.

There are numerous approaches to achieve data partitioning. For a more complete approach take a look at the createDataPartition function in the caTools package.

Here is a simple example:

data(mtcars)

## 75% of the sample size
smp_size <- floor(0.75 * nrow(mtcars))

## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(mtcars)), size = smp_size)

train <- mtcars[train_ind, ]
test <- mtcars[-train_ind, ]

It can be easily done by:

set.seed(101) # Set Seed so that same sample can be reproduced in future also
# Now Selecting 75% of data as sample from total 'n' rows of the data  
sample <- sample.int(n = nrow(data), size = floor(.75*nrow(data)), replace = F)
train <- data[sample, ]
test  <- data[-sample, ]

By using caTools package:

require(caTools)
set.seed(101) 
sample = sample.split(data$anycolumn, SplitRatio = .75)
train = subset(data, sample == TRUE)
test  = subset(data, sample == FALSE)

I would use dplyr for this, makes it super simple. It does require an id variable in your data set, which is a good idea anyway, not only for creating sets but also for traceability during your project. Add it if doesn't contain already.

mtcars$id <- 1:nrow(mtcars)
train <- mtcars %>% dplyr::sample_frac(.75)
test  <- dplyr::anti_join(mtcars, train, by = 'id')

This is almost the same code, but in more nice look

bound <- floor((nrow(df)/4)*3)         #define % of training and test set

df <- df[sample(nrow(df)), ]           #sample rows 
df.train <- df[1:bound, ]              #get training set
df.test <- df[(bound+1):nrow(df), ]    #get test set

Related questions
                            
                                What do hjust and vjust do when making a plot using ggplot?
                            
                                Most underused data visualization [closed]
                            
                                R: += (plus equals) and ++ (plus plus) equivalent from c++/c#/java, etc.?
                            
                                How to count TRUE values in a logical vector
                            
                                Importing data from a JSON file into R [duplicate]
                            
                                How to remove all whitespace from a string?
                            
                                Fastest way to find second (third...) highest/lowest value in vector or column
                            
                                Problems installing the devtools package
                            
                                Difference between R MarkDown and R NoteBook
                            
                                Order data frame rows according to vector with specific order
                            
                                Append value to empty vector in R?
                            
                                What does .SD stand for in data.table in R
                            
                                Call apply-like function on each row of dataframe with multiple arguments from each row
                            
                                Prevent row names to be written to file when using write.csv
                            
                                How to find common elements from multiple vectors?
                            
                                Annotating text on individual facet in ggplot2
                            
                                For each row in an R dataframe
                            
                                Convert row names into first column
                            
                                How to combine multiple conditions to subset a data-frame using "OR"?
                            
                                Show percent % instead of counts in charts of categorical variables

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to split data into training/testing sets using sample function

Tags:

split

r

sample

People also ask

Recent Activity

Donate For Us