I am trying to split my data frame into 2 parts randomly. For example, I'd like to get a random 70% of the data into one data frame and the other 30% into other data frame. Is there a fast way to do this? The number of rows in the original data frame is over 800000. I've tried with a for loop, selecting a random number from the number of rows, and then binding that row to the first (70%) data frame using rbind() and deleting it from the original data frame to get the other (30%) data frame. But this is extremely slow. Is there a relatively fast way I could do this?
Use the split() function in R to split a vector or data frame. Use the unsplit() method to retrieve the split vector or data frame.
When doing an automated split, you need to start by determining the sample size. We accomplish this by counting the rows and taking the appropriate fraction (80%) of the rows as our selected sample. Next, we use the sample function to select the appropriate rows as a vector of rows.
Try
n <- 100
data <- data.frame(x=runif(n), y=rnorm(n))
ind <- sample(c(TRUE, FALSE), n, replace=TRUE, prob=c(0.7, 0.3))
data1 <- data[ind, ]
data2 <- data[!ind, ]
I am building on the answer by ExperimenteR, which appears robust. One issue however is that the sample
function is a bit weird in that it uses probabilities, which are not completely deterministic. Take this for example:
>sample(c(TRUE, FALSE), n, replace=TRUE, prob=c(0.7, 0.3))
You would expect that the number of TRUE
and FALSE
values to be exactly 70 and 30, respectively. Oftentimes, this is not the case:
>table(sample(c(TRUE, FALSE), n, replace=TRUE, prob=c(0.7, 0.3)))
FALSE TRUE
34 66
Which is alright if you're not looking to be super precise. But if you would like exactly 70% and 30%, then do this instead:
v <- as.vector(c(rep(TRUE,70),rep(FALSE,30))) #create 70 TRUE, 30 FALSE
ind <- sample(v) #Sample them randomly.
data1 <- data[ind, ]
data2 <- data[!ind, ]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With