Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split vector randomly into two sets

I have a vector t with length 100 and want to divide it into 30 and 70 values but the values should be chosen randomly and without replacement. So none of the 30 values are allowed to be in the sub vector of the 70 values and vice versa.

I know the R function sample which I can use to randomly chose values from a vector with and without replacement. However, even when I use replace = FALSE I have to run the sample function twice once with 30 and once with 70 values to chose. That means that some of the 30 values might be in the 70 values and vice versa.

Any ideas?

like image 783
user969113 Avatar asked Sep 04 '12 10:09

user969113


People also ask

What does split() do in R?

Split() is a built-in R function that divides a vector or data frame into groups according to the function's parameters. It takes a vector or data frame as an argument and divides the information into groups. The syntax for this function is as follows: split(x, f, drop = FALSE, ...)

How do I split a vector in R?

split() function is used to split the vector. ceiling() is the function that takes two parameters one parameter that is vector with sequence along to divide the vector sequentially and second is chunklength, which represents the length of chunk to be divided.

How do I split a random Dataframe in R?

We accomplish this by counting the rows and taking the appropriate fraction (80%) of the rows as our selected sample. Next, we use the sample function to select the appropriate rows as a vector of rows. The final part involves splitting out the data set into the two portions.

How to separate a dataset in R?

To split the data frame in R, use the split() function. You can split a data set into subsets based on one or more variables representing groups of the data.


2 Answers

Regarding my comment, what is wrong with:

vec <- 1:100
set.seed(2)
samp <- sample(length(vec), 30)

a <- vec[samp]
b <- vec[-samp]

?

To show these are separate sets with no duplicates:

R> intersect(a, b)
integer(0)

If you have duplicate values in your vector that is a different matter, but your question is unclear.

With duplicates in vec things are a bit more complicated and it depends what result you wanted to achieve.

R> set.seed(4)
R> vec <- sample(100, 100, replace = TRUE)
R> set.seed(6)
R> samp <- sample(100, 30)
R> a <- vec[samp]
R> b <- vec[-samp]
R> length(a)
[1] 30
R> length(b)
[1] 70
R> length(setdiff(vec, a))
[1] 41

So the setdiff() "fails" here as it doesn't get the length right, but then a and b contain duplicate values (but not observations! from the sample):

R> intersect(a, b)
 [1] 57 35 91 27 71 63  8 92 49 77

The duplicates (intersection) arises because the values above occurred twice in the original sample vec

like image 43
Gavin Simpson Avatar answered Sep 19 '22 16:09

Gavin Simpson


How about this:

t <- 1:100 # or whatever your original set is
a <- sample(t, 70)
b <- setdiff(t, a)
like image 164
seancarmody Avatar answered Sep 19 '22 16:09

seancarmody