Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R split data into 2 parts randomly

Tags:

random

split

r

I am trying to split my data frame into 2 parts randomly. For example, I'd like to get a random 70% of the data into one data frame and the other 30% into other data frame. Is there a fast way to do this? The number of rows in the original data frame is over 800000. I've tried with a for loop, selecting a random number from the number of rows, and then binding that row to the first (70%) data frame using rbind() and deleting it from the original data frame to get the other (30%) data frame. But this is extremely slow. Is there a relatively fast way I could do this?

like image 582
gregorp Avatar asked Jul 01 '15 05:07

gregorp


People also ask

How do you split data into two parts in R?

Use the split() function in R to split a vector or data frame. Use the unsplit() method to retrieve the split vector or data frame.

How do I split data in half randomly in R?

When doing an automated split, you need to start by determining the sample size. We accomplish this by counting the rows and taking the appropriate fraction (80%) of the rows as our selected sample. Next, we use the sample function to select the appropriate rows as a vector of rows.


2 Answers

Try

n <- 100
data <- data.frame(x=runif(n), y=rnorm(n))
ind <- sample(c(TRUE, FALSE), n, replace=TRUE, prob=c(0.7, 0.3))
data1 <- data[ind, ]
data2 <- data[!ind, ]
like image 169
ExperimenteR Avatar answered Sep 21 '22 11:09

ExperimenteR


I am building on the answer by ExperimenteR, which appears robust. One issue however is that the sample function is a bit weird in that it uses probabilities, which are not completely deterministic. Take this for example:

>sample(c(TRUE, FALSE), n, replace=TRUE, prob=c(0.7, 0.3))

You would expect that the number of TRUE and FALSE values to be exactly 70 and 30, respectively. Oftentimes, this is not the case:

>table(sample(c(TRUE, FALSE), n, replace=TRUE, prob=c(0.7, 0.3)))

 FALSE  TRUE 
    34    66 

Which is alright if you're not looking to be super precise. But if you would like exactly 70% and 30%, then do this instead:

v <- as.vector(c(rep(TRUE,70),rep(FALSE,30))) #create 70 TRUE, 30 FALSE
ind <- sample(v) #Sample them randomly. 
data1 <- data[ind, ] 
data2 <- data[!ind, ] 
like image 45
Workhorse Avatar answered Sep 22 '22 11:09

Workhorse