Consider a character vector, pool
, whose elements are (zero-padded) binary numbers with up to max_len
digits.
max_len <- 4
pool <- unlist(lapply(seq_len(max_len), function(x)
do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))
pool
## [1] "0" "1" "00" "10" "01" "11" "000" "100" "010" "110"
## [11] "001" "101" "011" "111" "0000" "1000" "0100" "1100" "0010" "1010"
## [21] "0110" "1110" "0001" "1001" "0101" "1101" "0011" "1011" "0111" "1111"
I'd like to sample n
of these elements, with the constraint that none of the sampled elements are prefixes of any of the other sampled elements (i.e. if we sample 1101
, we ban 1
, 11
, and 110
, while if we sample 1
, we ban those elements beginning with 1
, such as 10
, 11
, 100
, etc.).
The following is my attempt using while
, but of course this is slow when n
is large (or when it approaches 2^max_len
).
set.seed(1)
n <- 10
chosen <- sample(pool, n)
while(any(rowSums(outer(paste0('^', chosen), chosen, Vectorize(grepl))) > 1)) {
prefixes <- rowSums(outer(paste0('^', chosen), chosen, Vectorize(grepl))) > 1
pool <- pool[rowSums(Vectorize(grepl, 'pattern')(
paste0('^', chosen[!prefixes]), pool)) == 0]
chosen <- c(chosen[!prefixes], sample(pool, sum(prefixes)))
}
chosen
## [1] "0100" "0101" "0001" "0011" "1000" "111" "0000" "0110" "1100" "0111"
This can be improved slightly by initially removing from pool
those elements whose inclusion would mean there are insufficient elements left in pool
to take a total sample of size n
. E.g., when max_len = 4
and n > 9
, we can immediately remove 0
and 1
from pool
, since by including either, the maximum sample would be 9 (either 0
and the eight 4-character elements beginning with 1
, or 1
and the eight 4-character elements beginning with 0
).
Based on this logic, we could then omit elements from pool
, prior to taking the initial sample, with something like:
pool <- pool[
nchar(pool) > tail(which(n > (2^max_len - rev(2^(0:max_len))[-1] + 1)), 1)]
Can anyone think of a better approach? I feel like I'm overlooking something much more simple.
EDIT
To clarify my intentions, I'll portray the pool as a set of branches, where the junctions and tips are the nodes (elements of pool
). Assume the yellow node in the following figure (i.e. 010) was drawn. Now, the entire red "branch", which consists of nodes 0, 01, and 010, is removed from the pool. This is what I meant by forbidding sampling of nodes that "prefix" nodes already in our sample (as well as those already prefixed by nodes in our sample).
If the sampled node was half-way along a branch, such as 01 in the following figure, then all red nodes (0, 01, 010, and 011) are disallowed, since 0 prefixes 01, and 01 prefixes both 010 and 011.
I don't mean to sample either 1 or 0 at each junction (i.e. walking along the branches flipping coins at forks) - it's fine to have both in the sample, as long as: (1) parents (or grand-parents, etc.) or children (grandchildren, etc.) of the node aren't already sampled; and (2) upon sampling the node there will be sufficient nodes remaining to achieve the desired sample of size n
.
In the second figure above, if 010 was the first pick, all nodes at black nodes are still (currently) valid, assuming n <= 4
. For example, if n==4
and we sampled node 1 next (and so our picks now included 01 and 1), we would subsequently disallow node 00 (due to rule 2 above) but could still pick 000 and 001, giving us our 4-element sample. If n==5
, on the other hand, node 1 would be disallowed at this stage.
You can sort the pool to help decide what elements to disqualify. For example, looking at a three element sorted pool:
[1] "0" "00" "000" "001" "01" "010" "011" "1" "10" "100" "101" "11"
[13] "110" "111"
I can tell that I can disqualify anything that follows my selected item that has more characters than my item up to first item that has the same number or fewer characters. For example, if I select "01", I can immediately see that the next two items ("010", "011") need to be removed, but not the one after that because "1" has fewer characters. Removing the "0" afterwards is easy. Here is an implementation:
library(fastmatch) # could use `match`, but we repeatedly search against same hash
# `pool` must be sorted!
sample01 <- function(pool, n) {
picked <- logical(length(pool))
chrs <- nchar(pool)
pick.list <- character(n)
pool.seq <- seq_along(pool)
for(i in seq(n)) {
# Make sure pool not exhausted
left <- which(!picked)
left.len <- length(left)
if(!length(left)) break
# Sample from pool
seq.left <- seq.int(left)
pool.left <- pool[left]
chrs.left <- chrs[left]
pick <- sample(length(pool.left), 1L)
# Find all the elements with more characters that are disqualified
# and store their indices in `valid` (bad name...)
valid.tmp <- chrs.left > chrs.left[[pick]] & seq.left > pick
first.invalid <- which(!valid.tmp & seq.left > pick)
valid <- if(length(first.invalid)) {
pick:(first.invalid[[1L]] - 1L)
} else pick:left.len
# Translate back to original pool indices since we're working on a
# subset in `pool.left`
pool.seq.left <- pool.seq[left]
pool.idx <- pool.seq.left[valid]
val <- pool[[pool.idx[[1L]]]]
# Record the picked value, and all the disqualifications
pick.list[[i]] <- val
picked[pool.idx] <- TRUE
# Disqualify shorter matches
to.rem <- vapply(
seq.int(nchar(val) - 1), substr, character(1L), x=val, start=1L
)
to.rem.idx <- fmatch(to.rem, pool, nomatch=0)
picked[to.rem.idx] <- TRUE
}
pick.list
}
And a function to make sorted pools (exactly like your code, but returns sorted):
make_pool <- function(size)
sort(
unlist(
lapply(
seq_len(size),
function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))
) ) )
Then, using the max_len
3 pool (useful for visually inspecting things behave as expected):
pool3 <- make_pool(3)
set.seed(1)
sample01(pool3, 8)
# [1] "001" "1" "010" "011" "000" "" "" ""
sample01(pool3, 8)
# [1] "110" "111" "011" "10" "00" "" "" ""
sample01(pool3, 8)
# [1] "000" "01" "11" "10" "001" "" "" ""
sample01(pool3, 8)
# [1] "011" "101" "111" "001" "110" "100" "000" "010"
Notice in the last case we get all 3 digit binary combinations (2 ^ 3) because by chance we kept sampling from the 3 digit ones. Also, with just a 3 size pool there are many samplings that prevent a full 8 draws; you could address this with your suggestion of eliminating the combinations that prevent full draws from the pool.
This is pretty fast. Looking at the max_len==9
example that took 2 seconds with the alternate solution:
pool9 <- make_pool(9)
microbenchmark(sample01(pool9, 4))
# Unit: microseconds
# expr min lq median uq max neval
# sample01(pool9, 4) 493.107 565.015 571.624 593.791 983.663 100
About half a millisecond. You can reasonably attempt fairly large pools too:
pool16 <- make_pool(16) # 131K entries
system.time(sample01(pool16, 100))
# user system elapsed
# 3.407 0.146 3.552
This is not very fast, but we're talking about a pool with 130K items. There is also potential for additional optimization.
Note the sorting step gets to be relatively slow for large pools, but I'm not counting it because you only ever need to do it once, and you could probably come up with a reasonable algorithm to produce the pool pre sorted.
There is also the possibility of a much faster integer to binary based approach that I explored in a now deleted answer, but that requires a fair bit more work to tie out to exactly what you're looking for.
I found the problem interesting so I tried and get this with really low skills in R (so this may be improved):
Newer edited version, thanks to @Franck advice:
library(microbenchmark)
library(lineprof)
max_len <- 16
pool <- unlist(lapply(seq_len(max_len), function(x)
do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))
n<-100
library(stringr)
tree_sample <- function(samples,pool) {
results <- vector("integer",samples)
# Will be used on a regular basis, compute it in advance
PoolLen <- str_length(pool)
# Make a mask vector based on the length of each entry of the pool
masks <- strtoi(str_pad(str_pad("1",PoolLen,"right","1"),max_len,"right","0"),base=2)
# Make an integer vector from "0" right padded orignal: for max_len=4 and pool entry "1" we get "1000" => 8
# This will allow to find this entry as parent of 10 and 11 which become "1000" and "1100", as integer 8 and 12 respectively
# once bitwise "anded" with the repective mask "1000" the first bit is striclty the same, so it's a parent.
integerPool <- strtoi(str_pad(pool,max_len,"right","0"),base=2)
# Create a vector to filter the available value to sample
ok <- rep(TRUE,length(pool))
#Precompute the result of the bitwise and betwwen our integer pool and the masks
MaskedPool <- bitwAnd(integerPool,masks)
while(samples) {
samp <- sample(pool[ok],1) # Get a sample
results[samples] <- samp # Store it as result
ok[pool == samp] <- FALSE # Remove it from available entries
vsamp <- strtoi(str_pad(samp,max_len,"right","0"),base=2) # Get the integer value of the "0" right padded sample
mlen <- str_length(samp) # Get sample len
#Creation of unitary mask to remove childs of sample
mask <- strtoi(paste0(rep(1:0,c(mlen,max_len-mlen)),collapse=""),base=2)
# Get the result of bitwise And between the integerPool and the sample mask
FilterVec <- bitwAnd(integerPool,mask)
# Get the bitwise and result of the sample and it's mask
Childm <- bitwAnd(vsamp,mask)
ok[FilterVec == Childm] <- FALSE # Remove from available entries the childs of the sample
ok[MaskedPool == bitwAnd(vsamp,masks)] <- FALSE # compare the sample with all the masks to remove parents matching
samples <- samples -1
}
print(results)
}
microbenchmark(tree_sample(n,pool),times=10L)
The main idea is to use the bitmask comparison to know if one sample is parent of the other (common bit part), if yes suppress this element from the pool.
It takes now 1,4 s to draw 100 samples from a pool of length 16 on my machine.
It is in python and not in r but jbaums said it would be okay.
So here is my contribution, see comments in the source for explanation of the crucial parts.
I'm still working on the analytical solution to determine the amount of possible combinations c
for a tree of depth t
and S
samples, so I can improve the function combs
. Maybe someone has it?
This is really the bottleneck now.
Sampling 100 nodes from a tree with depth 16 takes around 8ms on my laptop. Not the first time, but it gets faster up to a certain point the more you sample because combBuffer gets filled.
import random
class Tree(object):
"""
:param level: The distance of this node from the root.
:type level: int
:param parent: This trees parent node
:type parent: Tree
:param isleft: Determines if this is a left or a right child node. Can be
omitted if this is the root node.
:type isleft: bool
A binary tree representing possible strings which match r'[01]{1,n}'. Its
purpose is to be able to sample n of its nodes where none of the sampled
nodes' ids is a prefix for another one.
It is possible to change Tree.maxdepth and then reuse the root. All
children are created ON DEMAND, which means everything is lazily evaluated.
If the Tree gets too big anyway, you can call 'prune' on any node to delete
its children.
>>> t = Tree()
>>> t.sample(8, toString=True, depth=3)
['111', '110', '101', '100', '011', '010', '001', '000']
>>> Tree.maxdepth = 2
>>> t.sample(4, toString=True)
['11', '10', '01', '00']
"""
maxdepth = 10
_combBuffer = {}
def __init__(self, level=0, parent=None, isleft=None):
self.parent = parent
self.level = level
self.isleft = isleft
self._left = None
self._right = None
@classmethod
def setMaxdepth(cls, depth):
"""
:param depth: The new depth
:type depth: int
Sets the maxdepth of the Trees. This basically is the depth of the root
node.
"""
if cls.maxdepth == depth:
return
cls.maxdepth = depth
@property
def left(self):
"""This tree's left child, 'None' if this is a leave node"""
if self.depth == 0:
return None
if self._left is None:
self._left = Tree(self.level+1, self, True)
return self._left
@property
def right(self):
"""This tree's right child, 'None' if this is a leave node"""
if self.depth == 0:
return None
if self._right is None:
self._right = Tree(self.level+1, self, False)
return self._right
@property
def depth(self):
"""
This tree's depth. (maxdepth-level)
"""
return self.maxdepth-self.level
@property
def id(self):
"""
This tree's id, string of '0's and '1's equal to the path from the root
to this subtree. Where '1' means going left and '0' means going right.
"""
# level 0 is the root node, it has no id
if self.level == 0:
return ''
# This takes at most Tree.maxdepth recursions. Therefore
# it is save to do it this way. We could also save each nodes
# id once it is created to avoid recreating it every time, however
# this won't save much time but use quite some space.
return self.parent.id + ('1' if self.isleft else '0')
@property
def leaves(self):
"""
The amount of leave nodes, this tree has. (2**depth)
"""
return 2**self.depth
def __str__(self):
return self.id
def __len__(self):
return 2*self.leaves-1
def prune(self):
"""
Recursively prune this tree's children.
"""
if self._left is not None:
self._left.prune()
self._left.parent = None
self._left = None
if self._right is not None:
self._right.prune()
self._right.parent = None
self._right = None
def combs(self, n):
"""
:param n: The amount of samples to be taken from this tree
:type n: int
:returns: The amount of possible combinations to choose n samples from
this tree
Determines recursively the amount of combinations of n nodes to be
sampled from this tree.
Subsequent calls with same n on trees with same depth will return the
result from the previous computation rather than computing it again.
>>> t = Tree()
>>> Tree.maxdepth = 4
>>> t.combs(16)
1
>>> Tree.maxdepth = 3
>>> t.combs(6)
58
"""
# important for the amount of combinations is only n and the depth of
# this tree
key = (self.depth, n)
# We use the dict to save computation time. Calling the function with
# equal values on equal nodes just returns the alrady computed value if
# possible.
if key not in Tree._combBuffer:
leaves = self.leaves
if n < 0:
N = 0
elif n == 0 or self.depth == 0 or n == leaves:
N = 1
elif n == 1:
return (2*leaves-1)
else:
if n > leaves/2:
# if n > leaves/2, at least n-leaves/2 have to stay on
# either side, otherweise the other one would have to
# sample more nodes than possible.
nMin = n-leaves/2
else:
nMin = 0
# The rest n-2*nMin is the amount of samples that are free to
# fall on either side
free = n-2*nMin
N = 0
# sum up the combinations of all possible splits
for addLeft in range(0, free+1):
nLeft = nMin + addLeft
nRight = n - nLeft
N += self.left.combs(nLeft)*self.right.combs(nRight)
Tree._combBuffer[key] = N
return N
return Tree._combBuffer[key]
def sample(self, n, toString=False, depth=None):
"""
:param n: How may samples to take from this tree
:type n: int
:param toString: If 'True' result will direclty be turned into a list
of strings
:type toString: bool
:param depth: If not None, will overwrite Tree.maxdepth
:type depth: int or None
:returns: List of n nodes sampled from this tree
:throws ValueError: when n is invalid
Takes n random samples from this tree where none of the sample's ids is
a prefix for another one's.
For an example see Tree's docstring.
"""
if depth is not None:
Tree.setMaxdepth(depth)
if toString:
return [str(e) for e in self.sample(n)]
if n < 0:
raise ValueError('Negative sample size is not possible!')
if n == 0:
return []
leaves = self.leaves
if n > leaves:
raise ValueError(('Cannot sample {} nodes, with only {} ' +
'leaves!').format(n, leaves))
# Only one sample to choose, that is nice! We are free to take any node
# from this tree, including this very node itself.
if n == 1 and self.level > 0:
# This tree has 2*leaves-1 nodes, therefore
# the probability that we keep the root node has to be
# 1/(2*leaves-1) = P_root. Lets create a random number from the
# interval [0, 2*leaves-1).
# It will be 0 with probability 1/(2*leaves-1)
P_root = random.randint(0, len(self)-1)
if P_root == 0:
return [self]
else:
# The probability to land here is 1-P_root
# A child tree's size is (leaves-1) and since it obeys the same
# rule as above, the probability for each of its nodes to
# 'survive' is 1/(leaves-1) = P_child.
# However all nodes must have equal probability, therefore to
# make sure that their probability is also P_root we multiply
# them by 1/2*(1-P_root). The latter is already done, the
# former will be achieved by the next condition.
# If we do everything right, this should hold:
# 1/2 * (1-P_root) * P_child = P_root
# Lets see...
# 1/2 * (1-1/(2*leaves-1)) * (1/leaves-1)
# (1-1/(2*leaves-1)) * (1/(2*(leaves-1)))
# (1-1/(2*leaves-1)) * (1/(2*leaves-2))
# (1/(2*leaves-2)) - 1/((2*leaves-2) * (2*leaves-1))
# (2*leaves-1)/((2*leaves-2) * (2*leaves-1)) - 1/((2*leaves-2) * (2*leaves-1))
# (2*leaves-2)/((2*leaves-2) * (2*leaves-1))
# 1/(2*leaves-1)
# There we go!
if random.random() < 0.5:
return self.right.sample(1)
else:
return self.left.sample(1)
# Now comes the tricky part... n > 1 therefore we are NOT going to
# sample this node. Its probability to be chosen is 0!
# It HAS to be 0 since we are definitely sampling from one of its
# children which means that this node will be blocked by those samples.
# The difficult part now is to prove that the sampling the way we do it
# is really random.
if n > leaves/2:
# if n > leaves/2, at least n-leaves/2 have to stay on either
# side, otherweise the other one would have to sample more
# nodes than possible.
nMin = n-leaves/2
else:
nMin = 0
# The rest n-2*nMin is the amount of samples that are free to fall
# on either side
free = n-2*nMin
# Let's have a look at an example, suppose we were to distribute 5
# samples among two children which have 4 leaves each.
# Each child has to get at least 1 sample, so the free samples are 3.
# There are 4 different ways to split the samples among the
# children (left, right):
# (1, 4), (2, 3), (3, 2), (4, 1)
# The amount of unique sample combinations per child are
# (7, 1), (11, 6), (6, 11), (1, 7)
# The amount of total unique samples per possible split are
# 7 , 66 , 66 , 7
# In case of the first and last split, all samples have a probability
# of 1/7, this was already proven above.
# Lets suppose we are good to go and the per sample probabilities for
# the other two cases are (1/11, 1/6) and (1/6, 1/11), this way the
# overall per sample probabilities for the splits would be:
# 1/7 , 1/66 , 1/66 , 1/7
# If we used uniform random to determine the split, all splits would be
# equally probable and therefore be multiplied with the same value (1/4)
# But this would mean that NOT every sample is equally probable!
# We need to know in advance how many sample combinations there will be
# for a given split in order to find out the probability to choose it.
# In fact, due to the restrictions, this becomes very nasty to
# determine. So instead of solving it analytically, I do it numerically
# with the method 'combs'. It gives me the amount of possible sample
# combinations for a certain amount of samples and a given tree depth.
# It will return 146 for this node and 7 for the outer and 66 for the
# inner splits.
# What we now do is, we take a number from [0, 146).
# if it is smaller than 7, we sample from the first split,
# if it is smaller than 7+66, we sample from the second split,
# ...
# This way we get the probabilities we need.
r = random.randint(0, self.combs(n)-1)
p = 0
for addLeft in xrange(0, free+1):
nLeft = nMin + addLeft
nRight = n - nLeft
p += (self.left.combs(nLeft) * self.right.combs(nRight))
if r < p:
return self.left.sample(nLeft) + self.right.sample(nRight)
assert False, ('Something really strange happend, p did not sum up ' +
'to combs or r was too big')
def main():
"""
Do a microbenchmark.
"""
import timeit
i = 1
main.t = Tree()
template = ' {:>2} {:>5} {:>4} {:<5}'
print(template.format('i', 'depth', 'n', 'time (ms)'))
N = 100
for depth in [4, 8, 15, 16, 17, 18]:
for n in [10, 50, 100, 150]:
if n > 2**depth:
time = '--'
else:
time = timeit.timeit(
'main.t.sample({}, depth={})'.format(n, depth), setup=
'from __main__ import main', number=N)*1000./N
print(template.format(i, depth, n, time))
i += 1
if __name__ == "__main__":
main()
The benchmark output:
i depth n time (ms)
1 4 10 0.182511806488
2 4 50 --
3 4 100 --
4 4 150 --
5 8 10 0.397620201111
6 8 50 1.66054964066
7 8 100 2.90236949921
8 8 150 3.48146915436
9 15 10 0.804011821747
10 15 50 3.7428188324
11 15 100 7.34910964966
12 15 150 10.8230614662
13 16 10 0.804491043091
14 16 50 3.66818904877
15 16 100 7.09567070007
16 16 150 10.404779911
17 17 10 0.865840911865
18 17 50 3.9999294281
19 17 100 7.70257949829
20 17 150 11.3758206367
21 18 10 0.915451049805
22 18 50 4.22935962677
23 18 100 8.22361946106
24 18 150 12.2081303596
10 samples of size 10 from a tree with depth 10:
['1111010111', '1110111010', '1010111010', '011110010', '0111100001', '011101110', '01110010', '01001111', '0001000100', '000001010']
['110', '0110101110', '0110001100', '0011110', '0001111011', '0001100010', '0001100001', '0001100000', '0000011010', '0000001111']
['11010000', '1011111101', '1010001101', '1001110001', '1001100110', '10001110', '011111110', '011001100', '0101110000', '001110101']
['11111101', '110111', '110110111', '1101010101', '1101001011', '1001001100', '100100010', '0100001010', '0100000111', '0010010110']
['111101000', '1110111101', '1101101', '1101000000', '1011110001', '0111111101', '01101011', '011010011', '01100010', '0101100110']
['1111110001', '11000110', '1100010100', '101010000', '1010010001', '100011001', '100000110', '0100001111', '001101100', '0001101101']
['111110010', '1110100', '1101000011', '101101', '101000101', '1000001010', '0111100', '0101010011', '0101000110', '000100111']
['111100111', '1110001110', '1100111111', '1100110010', '11000110', '1011111111', '0111111', '0110000100', '0100011', '0010110111']
['1101011010', '1011111', '1011100100', '1010000010', '10010', '1000010100', '0111011111', '01010101', '001101', '000101100']
['111111110', '111101001', '1110111011', '111011011', '1001011101', '1000010100', '0111010101', '010100110', '0100001101', '0010000000']
Mapping ids to strings. You can map numbers to your 0/1 vectors, as @BrodieG mentioned:
# some key objects
n_pool = sum(2^(1:max_len)) # total number of indices
cuts = cumsum(2^(1:max_len-1)) # new group starts
inds_by_g = mapply(seq,cuts,cuts*2) # indices grouped by length
# the mapping to strings (one among many possibilities)
library(data.table)
get_01str <- function(id,max_len){
cuts = cumsum(2^(1:max_len-1))
g = findInterval(id,cuts)
gid = id-cuts[g]+1
data.table(g,gid)[,s:=
do.call(paste,c(list(sep=""),lapply(
seq(g[1]),
function(x) (gid-1) %/% 2^(x-1) %% 2
)))
,by=g]$s
}
Finding ids to drop. We'll sequentially drop id
s from the sampling pool:
# the mapping from one index to indices of nixed strings
get_nixstrs <- function(g,gid,max_len){
cuts = cumsum(2^(1:max_len-1))
gids_child = {
x = gid%%2^sequence(g-1)
ifelse(x,x,2^sequence(g-1))
}
ids_child = gids_child+cuts[sequence(g-1)]-1
ids_parent = if (g==max_len) gid+cuts[g]-1 else {
gids_par = vector(mode="list",max_len)
gids_par[[g]] = gid
for (gg in seq(g,max_len-1))
gids_par[[gg+1]] = c(gids_par[[gg]],gids_par[[gg]]+2^gg)
unlist(mapply(`+`,gids_par,cuts-1))
}
c(ids_child,ids_parent)
}
The indices are grouped by g
, the number of characters, nchar(get_01str(id))
. Because the indices are sorted by g
, g=findInterval(id,cuts)
is a faster route.
An index in group g
, 1 < g < max_len
has one "child" index of size g-1
and two parent indices of size g+1
. For each child node, we take its child node until we hit g==1
; and for each parent node, we take their pair of parent nodes until we hit g==max_len
.
The structure of the tree is simplest in terms of the identifier within the group, gid
. gid
maps to the two parents, gid
and gid+2^g
; and reversing this mapping one finds the child.
Sampling
drawem <- function(n,max_len){
cuts = cumsum(2^(1:max_len-1))
inds_by_g = mapply(seq,cuts,cuts*2)
oklens = (1:max_len)[ n <= 2^max_len*(1-2^(-(1:max_len)))+1 ]
okinds = unlist(inds_by_g[oklens])
mysamp = rep(0,n)
for (i in 1:n){
id = if (length(okinds)==1) okinds else sample(okinds,1)
g = findInterval(id,cuts)
gid = id-cuts[g]+1
nixed = get_nixstrs(g,gid,max_len)
# print(id); print(okinds); print(nixed)
mysamp[i] = id
okinds = setdiff(okinds,nixed)
if (!length(okinds)) break
}
res <- rep("",n)
res[seq.int(i)] <- get_01str(mysamp[seq.int(i)],max_len)
res
}
The oklens
part integrates the OP's idea for omitting strings that are guaranteed to make the sampling impossible. However, even doing this, we may follow a sampling path that leaves us with no more options. Taking the OP's example of max_len=4
and n=10
, we know we must drop 0
and 1
from consideration, but what happens if our first four draws are 00
, 01
, 11
and 10
? Oh well, I guess we are out of luck. This is why you should actually define sampling probabilities. (The OP has another idea, for determining, at each step, which nodes will lead to an impossible state, but that seems a tall order.)
Illustration
# how the indices line up
n_pool = sum(2^(1:max_len))
pdt <- data.table(id=1:n_pool)
pdt[,g:=findInterval(id,cuts)]
pdt[,gid:=1:.N,by=g]
pdt[,s:=get_01str(id,max_len)]
# example run
set.seed(4); drawem(5,5)
# [1] "01100" "1" "0001" "0101" "00101"
set.seed(4); drawem(8,4)
# [1] "1100" "0" "111" "101" "1101" "100" "" ""
Benchmarks (older than those in @BrodieG's answer)
require(rbenchmark)
max_len = 8
n = 8
benchmark(
jos_lp = {
pool <- unlist(lapply(seq_len(max_len),
function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))
sample.lp(pool, n)},
bro_string = {pool <- make_pool(max_len);sample01(pool,n)},
fra_num = drawem(n,max_len),
replications=5)[1:5]
# test replications elapsed relative user.self
# 2 bro_string 5 0.05 2.5 0.05
# 3 fra_num 5 0.02 1.0 0.02
# 1 jos_lp 5 1.56 78.0 1.55
n = 12
max_len = 12
benchmark(
bro_string={pool <- make_pool(max_len);sample01(pool,n)},
fra_num=drawem(n,max_len),
replications=5)[1:5]
# test replications elapsed relative user.self
# 1 bro_string 5 0.54 6.75 0.51
# 2 fra_num 5 0.08 1.00 0.08
Other answers. There are two other answers:
jos_enum = {pool <- unlist(lapply(seq_len(max_len),
function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))
get.template(pool, n)}
bro_num = sample011(max_len,n)
I left out @josilber's enumeration method because it took far too long; and @BrodieG's numeric/index method because it didn't work at the time, but now does. See @BrodieG's updated answer for more benchmarking.
Speed vs correctness. While @josilber's answers are far slower (and for the enumeration method, apparently much more memory-intensive), they are guaranteed to draw a sample of size n
on the first try. With @BrodieG's string method or this answer, you'll have to resample again and again in the hope of drawing a full n
. With large max_len
, that shouldn't be such a problem, I guess.
This answer scales better than bro_string
because it doesn't require the construction of pool
up front.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With