Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating data partition in R

With caret package, when creating data partition 75% training and 25% test, we use:

inTrain<- createDataPartition(y=spam$type,p=0.75, list=FALSE)

Note: dataset is named spam and target variable is named type

My question is, what is the purpose of including y=spam$type argument?

Isn’t the purpose of creating data partitions simply to split the entire data set based on the proportion you require for training vs testing? Why is there the need to include that argument in the code?

like image 771
Aiden Avatar asked Jul 20 '16 20:07

Aiden


People also ask

How do I create a dataset partition?

To configure file-based partitioning for a dataset, first activate partitioning by visiting the Partitioning tab under Settings, then specify the partitioning dimensions (e.g., time). To configure SQL-based partitioning, specify which column contains the values you want to use to logically partition the dataset.

How is data partitioning done?

Data Partitioning is the technique of distributing data across multiple tables, disks, or sites in order to improve query processing performance or increase database manageability. Query processing performance can be improved in one of two ways.

What does it mean to partition a dataset?

Partitioning is the database process where very large tables are divided into multiple smaller parts. By splitting a large table into smaller, individual tables, queries that access only a fraction of the data can run faster because there is less data to scan.


1 Answers

I have assumed that the createDataPartition() in question is referring to the caret package.

If sample$type argument is a factor which is generally the case, the random sampling occurs within each class.

Some more explanation: For example if we were to partition the iris data set in the same proportion as in your question.

attach(iris)
summary(iris)

notice the numbers against each species. Now using the following command:

library(caret)
inTrain <- createDataPartition(y=Species, p=0.75, list=FALSE)  

inTrain would take approximately 75% rows from each species, which can be verified by issuing the following command:

summary(iris[inTrain,])

There are 50 species in each category, and 38 (approximately 75%)have been randomly selected for the training data set.

like image 171
Imran Ali Avatar answered Sep 28 '22 16:09

Imran Ali