Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I split a data frame in R randomly?

Tags:

dataframe

r

I have a data frame with ca. 1000 rows, and I want to split it randomly into 8 smaller dataframes each containing 100 element. I tried to used the sample function 8 times on the data frame, but sometimes it selects the same rows.

like image 747
Lanza Avatar asked Apr 16 '16 10:04

Lanza


People also ask

How do I split a random Dataframe in R?

We accomplish this by counting the rows and taking the appropriate fraction (80%) of the rows as our selected sample. Next, we use the sample function to select the appropriate rows as a vector of rows. The final part involves splitting out the data set into the two portions.

How do I split a Dataframe into multiple Dataframes in R?

To split the above Dataframe we use the split() function. The syntax of split() function is: Syntax: split(x, f, drop = FALSE, …)

How do you split data into bins in R?

The cut() method in base R is used to first divide the range of the dataframe and then divide the values based on the intervals in which they fall. Each of the intervals corresponds to one level of the dataframe. Therefore, the number of levels is equivalent to the length of the breaks argument in the cut method.

What is random splitting?

A random split will split a cluster across sets, causing skew. A simple approach to fixing this problem would be to split our data based on when the story was published, perhaps by day the story was published. This results in stories from the same day being placed in the same split.


1 Answers

We create a grouping variable by sampleing 1 to 8 with size as the number of rows of the dataset, split the sequence of rows with the grouping variable in a list, loop through the list (lapply(...), subset the dataset and get the first 100 rows with head

lst <- lapply(split(1:nrow(df1), sample(1:8, nrow(df1), replace=TRUE, prob = rep(1/8, 8))),
           function(i) head(df1[i,],100))
sapply(lst, nrow)
#  1   2   3   4   5   6   7   8 
#100 100 100 100 100 100 100 100 

As @RHertel mentioned in the comments, we can do a second sample to get the 100 rows

lst <- lapply(split(1:nrow(df1), sample(1:8, nrow(df1), replace=TRUE, prob = rep(1/8, 8))),
       function(i) df1[sample(i, 100, replace=FALSE),])

data

set.seed(24)
df1 <- data.frame(V1= 1:1000, V2= rnorm(1000))
like image 178
akrun Avatar answered Oct 18 '22 03:10

akrun