I have a vector t with length 100 and want to divide it into 30 and 70 values but the values should be chosen randomly and without replacement. So none of the 30 values are allowed to be in the sub vector of the 70 values and vice versa.
I know the R function sample
which I can use to randomly chose values from a vector with and without replacement. However, even when I use replace = FALSE I have to run the sample
function twice once with 30 and once with 70 values to chose. That means that some of the 30 values might be in the 70 values and vice versa.
Any ideas?
Split() is a built-in R function that divides a vector or data frame into groups according to the function's parameters. It takes a vector or data frame as an argument and divides the information into groups. The syntax for this function is as follows: split(x, f, drop = FALSE, ...)
split() function is used to split the vector. ceiling() is the function that takes two parameters one parameter that is vector with sequence along to divide the vector sequentially and second is chunklength, which represents the length of chunk to be divided.
We accomplish this by counting the rows and taking the appropriate fraction (80%) of the rows as our selected sample. Next, we use the sample function to select the appropriate rows as a vector of rows. The final part involves splitting out the data set into the two portions.
To split the data frame in R, use the split() function. You can split a data set into subsets based on one or more variables representing groups of the data.
Regarding my comment, what is wrong with:
vec <- 1:100
set.seed(2)
samp <- sample(length(vec), 30)
a <- vec[samp]
b <- vec[-samp]
?
To show these are separate sets with no duplicates:
R> intersect(a, b)
integer(0)
If you have duplicate values in your vector that is a different matter, but your question is unclear.
With duplicates in vec
things are a bit more complicated and it depends what result you wanted to achieve.
R> set.seed(4)
R> vec <- sample(100, 100, replace = TRUE)
R> set.seed(6)
R> samp <- sample(100, 30)
R> a <- vec[samp]
R> b <- vec[-samp]
R> length(a)
[1] 30
R> length(b)
[1] 70
R> length(setdiff(vec, a))
[1] 41
So the setdiff()
"fails" here as it doesn't get the length right, but then a
and b
contain duplicate values (but not observations! from the sample):
R> intersect(a, b)
[1] 57 35 91 27 71 63 8 92 49 77
The duplicates (intersection) arises because the values above occurred twice in the original sample vec
How about this:
t <- 1:100 # or whatever your original set is
a <- sample(t, 70)
b <- setdiff(t, a)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With