Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sample with replacement but constrain the max frequency of each member to be drawn

Tags:

random

r

sample

Is it possible to extend the sample function in R to not return more than say 2 of the same element when replace = TRUE?

Suppose I have a list:

l = c(1,1,2,3,4,5)

To sample 3 elements with replacement, I would do:

sample(l, 3, replace = TRUE)

Is there a way to constrain its output so that only a maximum of 2 of the same elements are returned? So (1,1,2) or (1,3,3) is allowed, but (1,1,1) or (3,3,3) is excluded?

like image 647
Thomas Moore Avatar asked Sep 30 '18 22:09

Thomas Moore


People also ask

Why must we sample with replacement when resampling?

When we sample with replacement, the two sample values are independent. Practically, this means that what we get on the first one doesn't affect what we get on the second. Mathematically, this means that the covariance between the two is zero. In sampling without replacement, the two sample values aren't independent.

What is sampling with replacement?

When a sampling unit is drawn from a finite population and is returned to that population, after its characteristic(s) have been recorded, before the next unit is drawn, the sampling is said to be “with replacement”. In the contrary case the sampling is “without replacement”.

How many samples can be drawn from this population if sampling without replacement will be applied?

In sampling without replacement, each sample unit of the population has only one chance to be selected in the sample. For example, if one draws a simple random sample such that no unit occurs more than one time in the sample, the sample is drawn without replacement.

When we draw the sample with replacement the probability distribution to be used is?

In sampling with replacement the corresponding probability is [1−(11−n)r].


1 Answers

set.seed(0)

The basic idea is to convert sampling with replacement to sampling without replacement.

ll <- unique(l)          ## unique values
#[1] 1 2 3 4 5
pool <- rep.int(ll, 2)   ## replicate each unique so they each appear twice
#[1] 1 2 3 4 5 1 2 3 4 5
sample(pool, 3)          ## draw 3 samples without replacement
#[1] 4 3 5

## replicate it a few times
## each column is a sample after out "simplification" by `replicate`
replicate(5, sample(pool, 3))
#     [,1] [,2] [,3] [,4] [,5]
#[1,]    1    4    2    2    3
#[2,]    4    5    1    2    5
#[3,]    2    1    2    4    1

If you wish different value to appear up to different number of times, we can do for example

pool <- rep.int(ll, c(2, 3, 3, 4, 1))
#[1] 1 1 2 2 2 3 3 3 4 4 4 4 5

## draw 9 samples; replicate 5 times
oo <- replicate(5, sample(pool, 9))
#      [,1] [,2] [,3] [,4] [,5]
# [1,]    5    1    4    3    2
# [2,]    2    2    4    4    1
# [3,]    4    4    1    1    1
# [4,]    4    2    3    2    5
# [5,]    1    4    2    5    2
# [6,]    3    4    3    3    3
# [7,]    1    4    2    2    2
# [8,]    4    1    4    3    3
# [9,]    3    3    2    2    4

We can call tabulate on each column to count the frequency of 1, 2, 3, 4, 5:

## set `nbins` in `tabulate` so frequency table of each column has the same length
apply(oo, 2L, tabulate, nbins = 5)
#     [,1] [,2] [,3] [,4] [,5]
#[1,]    2    2    1    1    2
#[2,]    1    2    3    3    3
#[3,]    2    1    2    3    2
#[4,]    3    4    3    1    1
#[5,]    1    0    0    1    1

The count in all columns meet the frequency upper bound c(2, 3, 3, 4, 1) we have set.


Would you explain the difference between rep and rep.int?

rep.int is not the "integer" method for rep. It is just a faster primitive function with less functionality than rep. You can get more details of rep, rep.int and rep_len from the doc page ?rep.

like image 102
Zheyuan Li Avatar answered Sep 20 '22 00:09

Zheyuan Li