I have a vector t with length 100 and want to divide it into 30 and 70 values but the values should be chosen randomly and without replacement. So none of the 30 values are allowed to be in the sub vector of the 70 values and vice versa. I know the R function <code>sample</code> which I can use to randomly chose values from a vector with and without replacement. However, even when I use replace = FALSE I have to run the <code>sample</code> function twice once with 30 and once with 70 values to chose. That means that some of the 30 values might be in the 70 values and vice versa. Any ideas?

Regarding my comment, what is wrong with: <pre class="prettyprint"><code>vec <- 1:100 set.seed(2) samp <- sample(length(vec), 30) a <- vec[samp] b <- vec[-samp] </code></pre> ? To show these are separate sets with no duplicates: <pre class="prettyprint"><code>R> intersect(a, b) integer(0) </code></pre> If you have duplicate values in your vector that is a different matter, but your question is unclear. With duplicates in <code>vec</code> things are a bit more complicated and it depends what result you wanted to achieve. <pre class="prettyprint"><code>R> set.seed(4) R> vec <- sample(100, 100, replace = TRUE) R> set.seed(6) R> samp <- sample(100, 30) R> a <- vec[samp] R> b <- vec[-samp] R> length(a) [1] 30 R> length(b) [1] 70 R> length(setdiff(vec, a)) [1] 41 </code></pre> So the <code>setdiff()</code> "fails" here as it doesn't get the length right, but then <code>a</code> and <code>b</code> contain duplicate values (but not observations! from the sample): <pre class="prettyprint"><code>R> intersect(a, b) [1] 57 35 91 27 71 63 8 92 49 77 </code></pre> The duplicates (intersection) arises because the values above occurred twice in the original sample <code>vec</code>

How about this: <pre class="prettyprint"><code>t <- 1:100 # or whatever your original set is a <- sample(t, 70) b <- setdiff(t, a) </code></pre>

Split vector randomly into two sets

Tags:

random

r

sample

random-sample

I have a vector t with length 100 and want to divide it into 30 and 70 values but the values should be chosen randomly and without replacement. So none of the 30 values are allowed to be in the sub vector of the 70 values and vice versa.

I know the R function sample which I can use to randomly chose values from a vector with and without replacement. However, even when I use replace = FALSE I have to run the sample function twice once with 30 and once with 70 values to chose. That means that some of the 30 values might be in the 70 values and vice versa.

Any ideas?

783

asked Sep 04 '12 10:09

user969113

2 Answers

Regarding my comment, what is wrong with:

vec <- 1:100
set.seed(2)
samp <- sample(length(vec), 30)

a <- vec[samp]
b <- vec[-samp]

To show these are separate sets with no duplicates:

R> intersect(a, b)
integer(0)

If you have duplicate values in your vector that is a different matter, but your question is unclear.

With duplicates in vec things are a bit more complicated and it depends what result you wanted to achieve.

R> set.seed(4)
R> vec <- sample(100, 100, replace = TRUE)
R> set.seed(6)
R> samp <- sample(100, 30)
R> a <- vec[samp]
R> b <- vec[-samp]
R> length(a)
[1] 30
R> length(b)
[1] 70
R> length(setdiff(vec, a))
[1] 41

So the setdiff() "fails" here as it doesn't get the length right, but then a and b contain duplicate values (but not observations! from the sample):

R> intersect(a, b)
 [1] 57 35 91 27 71 63  8 92 49 77

The duplicates (intersection) arises because the values above occurred twice in the original sample vec

answered Sep 19 '22 16:09

Gavin Simpson

How about this:

t <- 1:100 # or whatever your original set is
a <- sample(t, 70)
b <- setdiff(t, a)

164

answered Sep 19 '22 16:09

seancarmody

Related questions
                            
                                Dense Rank by Multiple Columns in R
                            
                                Animate ggplot time series plot with a sliding window
                            
                                return ID's of unique combinations
                            
                                applying a function across columns by extracting similar column names
                            
                                How to remove an unnamed element from a single item list?
                            
                                How does one overcome overlapping points without jitter or transparency in ggplot2
                            
                                Convert Twitter Timestamp in R
                            
                                compare adjacent elements of the same vector (avoiding loops)
                            
                                Is there something like a pmax index?
                            
                                replace .. with . in R
                            
                                Find rows in a data frame where two columns are equal
                            
                                add on.exit expr to parent call?
                            
                                ggplot2: plotting order of factors within a geom
                            
                                lm predict won't predict
                            
                                strsplit one column with exact information into two column
                            
                                R ggplot geom_tile without fill color
                            
                                passing objects to return results in an error
                            
                                Pass subset argument through a function to subset
                            
                                ggplot2: Thresholds for scale_alpha()
                            
                                ggplot vertically justify legend

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With