Random sample of character vector, without elements prefixing one another

Question

Consider a character vector, pool, whose elements are (zero-padded) binary numbers with up to max_len digits.

max_len <- 4
pool <- unlist(lapply(seq_len(max_len), function(x) 
  do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))

pool
##  [1] "0"    "1"    "00"   "10"   "01"   "11"   "000"  "100"  "010"  "110" 
## [11] "001"  "101"  "011"  "111"  "0000" "1000" "0100" "1100" "0010" "1010"
## [21] "0110" "1110" "0001" "1001" "0101" "1101" "0011" "1011" "0111" "1111"

I'd like to sample n of these elements, with the constraint that none of the sampled elements are prefixes of any of the other sampled elements (i.e. if we sample 1101, we ban 1, 11, and 110, while if we sample 1, we ban those elements beginning with 1, such as 10, 11, 100, etc.).

The following is my attempt using while, but of course this is slow when n is large (or when it approaches 2^max_len).

set.seed(1)
n <- 10
chosen <- sample(pool, n)
while(any(rowSums(outer(paste0('^', chosen), chosen, Vectorize(grepl))) > 1)) {
  prefixes <- rowSums(outer(paste0('^', chosen), chosen, Vectorize(grepl))) > 1
  pool <- pool[rowSums(Vectorize(grepl, 'pattern')(
    paste0('^', chosen[!prefixes]), pool)) == 0]
  chosen <- c(chosen[!prefixes], sample(pool, sum(prefixes)))
}

chosen
## [1] "0100" "0101" "0001" "0011" "1000" "111"  "0000" "0110" "1100" "0111"

This can be improved slightly by initially removing from pool those elements whose inclusion would mean there are insufficient elements left in pool to take a total sample of size n. E.g., when max_len = 4 and n > 9, we can immediately remove 0 and 1 from pool, since by including either, the maximum sample would be 9 (either 0 and the eight 4-character elements beginning with 1, or 1 and the eight 4-character elements beginning with 0).

Based on this logic, we could then omit elements from pool, prior to taking the initial sample, with something like:

pool <- pool[
  nchar(pool) > tail(which(n > (2^max_len - rev(2^(0:max_len))[-1] + 1)), 1)]

Can anyone think of a better approach? I feel like I'm overlooking something much more simple.

EDIT

To clarify my intentions, I'll portray the pool as a set of branches, where the junctions and tips are the nodes (elements of pool). Assume the yellow node in the following figure (i.e. 010) was drawn. Now, the entire red "branch", which consists of nodes 0, 01, and 010, is removed from the pool. This is what I meant by forbidding sampling of nodes that "prefix" nodes already in our sample (as well as those already prefixed by nodes in our sample).

enter image description here

If the sampled node was half-way along a branch, such as 01 in the following figure, then all red nodes (0, 01, 010, and 011) are disallowed, since 0 prefixes 01, and 01 prefixes both 010 and 011.

enter image description here

I don't mean to sample either 1 or 0 at each junction (i.e. walking along the branches flipping coins at forks) - it's fine to have both in the sample, as long as: (1) parents (or grand-parents, etc.) or children (grandchildren, etc.) of the node aren't already sampled; and (2) upon sampling the node there will be sufficient nodes remaining to achieve the desired sample of size n.

In the second figure above, if 010 was the first pick, all nodes at black nodes are still (currently) valid, assuming n <= 4. For example, if n==4 and we sampled node 1 next (and so our picks now included 01 and 1), we would subsequently disallow node 00 (due to rule 2 above) but could still pick 000 and 001, giving us our 4-element sample. If n==5, on the other hand, node 1 would be disallowed at this stage.

BrodieG · Accepted Answer

You can sort the pool to help decide what elements to disqualify. For example, looking at a three element sorted pool:

 [1] "0"   "00"  "000" "001" "01"  "010" "011" "1"   "10"  "100" "101" "11" 
[13] "110" "111"

I can tell that I can disqualify anything that follows my selected item that has more characters than my item up to first item that has the same number or fewer characters. For example, if I select "01", I can immediately see that the next two items ("010", "011") need to be removed, but not the one after that because "1" has fewer characters. Removing the "0" afterwards is easy. Here is an implementation:

library(fastmatch)  # could use `match`, but we repeatedly search against same hash

# `pool` must be sorted!

sample01 <- function(pool, n) {
  picked <- logical(length(pool))
  chrs <- nchar(pool)
  pick.list <- character(n)
  pool.seq <- seq_along(pool)

  for(i in seq(n)) {
    # Make sure pool not exhausted

    left <- which(!picked)
    left.len <- length(left)
    if(!length(left)) break

    # Sample from pool

    seq.left <- seq.int(left)
    pool.left <- pool[left]
    chrs.left <- chrs[left]
    pick <- sample(length(pool.left), 1L)

    # Find all the elements with more characters that are disqualified
    # and store their indices in `valid` (bad name...)

    valid.tmp <- chrs.left > chrs.left[[pick]] & seq.left > pick
    first.invalid <- which(!valid.tmp & seq.left > pick)
    valid <- if(length(first.invalid)) {
      pick:(first.invalid[[1L]] - 1L)
    } else pick:left.len

    # Translate back to original pool indices since we're working on a 
    # subset in `pool.left`

    pool.seq.left <- pool.seq[left]
    pool.idx <- pool.seq.left[valid]
    val <- pool[[pool.idx[[1L]]]]

    # Record the picked value, and all the disqualifications

    pick.list[[i]] <- val
    picked[pool.idx] <- TRUE

    # Disqualify shorter matches

    to.rem <- vapply(
      seq.int(nchar(val) - 1), substr, character(1L), x=val, start=1L
    )
    to.rem.idx <- fmatch(to.rem, pool, nomatch=0)
    picked[to.rem.idx] <- TRUE  
  }
  pick.list  
}

And a function to make sorted pools (exactly like your code, but returns sorted):

make_pool <- function(size)
  sort(
    unlist(
      lapply(
        seq_len(size), 
        function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x))) 
  ) ) )

Then, using the max_len 3 pool (useful for visually inspecting things behave as expected):

pool3 <- make_pool(3)
set.seed(1)
sample01(pool3, 8)
# [1] "001" "1"   "010" "011" "000" ""    ""    ""   
sample01(pool3, 8)
# [1] "110" "111" "011" "10"  "00"  ""    ""    ""   
sample01(pool3, 8)
# [1] "000" "01"  "11"  "10"  "001" ""    ""    ""   
sample01(pool3, 8)
# [1] "011" "101" "111" "001" "110" "100" "000" "010"

Notice in the last case we get all 3 digit binary combinations (2 ^ 3) because by chance we kept sampling from the 3 digit ones. Also, with just a 3 size pool there are many samplings that prevent a full 8 draws; you could address this with your suggestion of eliminating the combinations that prevent full draws from the pool.

This is pretty fast. Looking at the max_len==9 example that took 2 seconds with the alternate solution:

pool9 <- make_pool(9)
microbenchmark(sample01(pool9, 4))
# Unit: microseconds
#                expr     min      lq  median      uq     max neval
#  sample01(pool9, 4) 493.107 565.015 571.624 593.791 983.663   100

About half a millisecond. You can reasonably attempt fairly large pools too:

pool16 <- make_pool(16)  # 131K entries
system.time(sample01(pool16, 100))
#  user  system elapsed 
# 3.407   0.146   3.552

This is not very fast, but we're talking about a pool with 130K items. There is also potential for additional optimization.

Note the sorting step gets to be relatively slow for large pools, but I'm not counting it because you only ever need to do it once, and you could probably come up with a reasonable algorithm to produce the pool pre sorted.

There is also the possibility of a much faster integer to binary based approach that I explored in a now deleted answer, but that requires a fair bit more work to tie out to exactly what you're looking for.

Tensibai · Answer

I found the problem interesting so I tried and get this with really low skills in R (so this may be improved):

Newer edited version, thanks to @Franck advice:

library(microbenchmark)
library(lineprof)

max_len <- 16
pool <- unlist(lapply(seq_len(max_len), function(x) 
  do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))
n<-100

library(stringr)
tree_sample <- function(samples,pool) {
  results <- vector("integer",samples)
  # Will be used on a regular basis, compute it in advance
  PoolLen <- str_length(pool)
  # Make a mask vector based on the length of each entry of the pool
  masks <- strtoi(str_pad(str_pad("1",PoolLen,"right","1"),max_len,"right","0"),base=2)

  # Make an integer vector from "0" right padded orignal: for max_len=4 and pool entry "1" we get "1000" => 8
  # This will allow to find this entry as parent of 10 and 11 which become "1000" and "1100", as integer 8 and 12 respectively
  # once bitwise "anded" with the repective mask "1000" the first bit is striclty the same, so it's a parent.
  integerPool <- strtoi(str_pad(pool,max_len,"right","0"),base=2)

  # Create a vector to filter the available value to sample
  ok <- rep(TRUE,length(pool))

  #Precompute the result of the bitwise and betwwen our integer pool and the masks   
  MaskedPool <- bitwAnd(integerPool,masks)

  while(samples) {
    samp <- sample(pool[ok],1) # Get a sample
    results[samples] <- samp # Store it as result
    ok[pool == samp] <- FALSE # Remove it from available entries

    vsamp <- strtoi(str_pad(samp,max_len,"right","0"),base=2) # Get the integer value of the "0" right padded sample
    mlen <- str_length(samp) # Get sample len

    #Creation of unitary mask to remove childs of sample
    mask <- strtoi(paste0(rep(1:0,c(mlen,max_len-mlen)),collapse=""),base=2)

    # Get the result of bitwise And between the integerPool and the sample mask 
    FilterVec <- bitwAnd(integerPool,mask)

    # Get the bitwise and result of the sample and it's mask
    Childm <- bitwAnd(vsamp,mask)

    ok[FilterVec == Childm] <- FALSE  # Remove from available entries the childs of the sample
    ok[MaskedPool == bitwAnd(vsamp,masks)] <- FALSE # compare the sample with all the masks to remove parents matching

    samples <- samples -1
  }
  print(results)
}
microbenchmark(tree_sample(n,pool),times=10L)

The main idea is to use the bitmask comparison to know if one sample is parent of the other (common bit part), if yes suppress this element from the pool.

It takes now 1,4 s to draw 100 samples from a pool of length 16 on my machine.

swenzel · Answer

It is in python and not in r but jbaums said it would be okay.

So here is my contribution, see comments in the source for explanation of the crucial parts.
I'm still working on the analytical solution to determine the amount of possible combinations c for a tree of depth t and S samples, so I can improve the function combs. Maybe someone has it? This is really the bottleneck now.

Sampling 100 nodes from a tree with depth 16 takes around 8ms on my laptop. Not the first time, but it gets faster up to a certain point the more you sample because combBuffer gets filled.

import random


class Tree(object):
    """
    :param level: The distance of this node from the root.
    :type level: int
    :param parent: This trees parent node
    :type parent: Tree
    :param isleft: Determines if this is a left or a right child node. Can be
                   omitted if this is the root node.
    :type isleft: bool

    A binary tree representing possible strings which match r'[01]{1,n}'. Its
    purpose is to be able to sample n of its nodes where none of the sampled
    nodes' ids is a prefix for another one.
    It is possible to change Tree.maxdepth and then reuse the root. All
    children are created ON DEMAND, which means everything is lazily evaluated.
    If the Tree gets too big anyway, you can call 'prune' on any node to delete
    its children.

        >>> t = Tree()
        >>> t.sample(8, toString=True, depth=3)
        ['111', '110', '101', '100', '011', '010', '001', '000']
        >>> Tree.maxdepth = 2
        >>> t.sample(4, toString=True)
        ['11', '10', '01', '00']
    """

    maxdepth = 10
    _combBuffer = {}

    def __init__(self, level=0, parent=None, isleft=None):
        self.parent = parent
        self.level = level
        self.isleft = isleft
        self._left = None
        self._right = None

    @classmethod
    def setMaxdepth(cls, depth):
        """
        :param depth: The new depth
        :type depth: int

        Sets the maxdepth of the Trees. This basically is the depth of the root
        node.
        """
        if cls.maxdepth == depth:
            return

        cls.maxdepth = depth

    @property
    def left(self):
        """This tree's left child, 'None' if this is a leave node"""
        if self.depth == 0:
            return None

        if self._left is None:
            self._left = Tree(self.level+1, self, True)
        return self._left

    @property
    def right(self):
        """This tree's right child, 'None' if this is a leave node"""
        if self.depth == 0:
            return None

        if self._right is None:
            self._right = Tree(self.level+1, self, False)
        return self._right

    @property
    def depth(self):
        """
        This tree's depth. (maxdepth-level)
        """
        return self.maxdepth-self.level

    @property
    def id(self):
        """
        This tree's id, string of '0's and '1's equal to the path from the root
        to this subtree. Where '1' means going left and '0' means going right.
        """
        # level 0 is the root node, it has no id
        if self.level == 0:
            return ''
        # This takes at most Tree.maxdepth recursions. Therefore
        # it is save to do it this way. We could also save each nodes
        # id once it is created to avoid recreating it every time, however
        # this won't save much time but use quite some space.
        return self.parent.id + ('1' if self.isleft else '0')

    @property
    def leaves(self):
        """
        The amount of leave nodes, this tree has. (2**depth)
        """
        return 2**self.depth

    def __str__(self):
        return self.id

    def __len__(self):
        return 2*self.leaves-1

    def prune(self):
        """
        Recursively prune this tree's children.
        """
        if self._left is not None:
            self._left.prune()
            self._left.parent = None
            self._left = None

        if self._right is not None:
            self._right.prune()
            self._right.parent = None
            self._right = None

    def combs(self, n):
        """
        :param n: The amount of samples to be taken from this tree
        :type n: int
        :returns: The amount of possible combinations to choose n samples from
                  this tree

        Determines recursively the amount of combinations of n nodes to be
        sampled from this tree.
        Subsequent calls with same n on trees with same depth will return the
        result from the previous computation rather than computing it again.

            >>> t = Tree()
            >>> Tree.maxdepth = 4
            >>> t.combs(16)
            1
            >>> Tree.maxdepth = 3
            >>> t.combs(6)
            58
        """

        # important for the amount of combinations is only n and the depth of
        # this tree
        key = (self.depth, n)

        # We use the dict to save computation time. Calling the function with
        # equal values on equal nodes just returns the alrady computed value if
        # possible.
        if key not in Tree._combBuffer:
            leaves = self.leaves

            if n < 0:
                N = 0
            elif n == 0 or self.depth == 0 or n == leaves:
                N = 1
            elif n == 1:
                return (2*leaves-1)
            else:
                if n > leaves/2:
                    # if n > leaves/2, at least n-leaves/2 have to stay on
                    # either side, otherweise the other one would have to
                    # sample more nodes than possible.
                    nMin = n-leaves/2
                else:
                    nMin = 0

                # The rest n-2*nMin is the amount of samples that are free to
                # fall on either side
                free = n-2*nMin

                N = 0
                # sum up the combinations of all possible splits
                for addLeft in range(0, free+1):
                    nLeft = nMin + addLeft
                    nRight = n - nLeft
                    N += self.left.combs(nLeft)*self.right.combs(nRight)

            Tree._combBuffer[key] = N
            return N
        return Tree._combBuffer[key]

    def sample(self, n, toString=False, depth=None):
        """
        :param n: How may samples to take from this tree
        :type n: int
        :param toString: If 'True' result will direclty be turned into a list
                         of strings
        :type toString: bool
        :param depth: If not None, will overwrite Tree.maxdepth
        :type depth: int or None
        :returns: List of n nodes sampled from this tree
        :throws ValueError: when n is invalid

        Takes n random samples from this tree where none of the sample's ids is
        a prefix for another one's.

        For an example see Tree's docstring.
        """
        if depth is not None:
            Tree.setMaxdepth(depth)

        if toString:
            return [str(e) for e in self.sample(n)]

        if n < 0:
            raise ValueError('Negative sample size is not possible!')

        if n == 0:
            return []

        leaves = self.leaves
        if n > leaves:
            raise ValueError(('Cannot sample {} nodes, with only {} ' +
                              'leaves!').format(n, leaves))

        # Only one sample to choose, that is nice! We are free to take any node
        # from this tree, including this very node itself.
        if n == 1 and self.level > 0:
            # This tree has 2*leaves-1 nodes, therefore
            # the probability that we keep the root node has to be
            # 1/(2*leaves-1) = P_root. Lets create a random number from the
            # interval [0, 2*leaves-1).
            # It will be 0 with probability 1/(2*leaves-1)
            P_root = random.randint(0, len(self)-1)
            if P_root == 0:
                return [self]
            else:
                # The probability to land here is 1-P_root

                # A child tree's size is (leaves-1) and since it obeys the same
                # rule as above, the probability for each of its nodes to
                # 'survive' is 1/(leaves-1) = P_child.
                # However all nodes must have equal probability, therefore to
                # make sure that their probability is also P_root we multiply
                # them by 1/2*(1-P_root). The latter is already done, the
                # former will be achieved by the next condition.
                # If we do everything right, this should hold:
                # 1/2 * (1-P_root) * P_child = P_root

                # Lets see...
                # 1/2 * (1-1/(2*leaves-1)) * (1/leaves-1)
                # (1-1/(2*leaves-1)) * (1/(2*(leaves-1)))
                # (1-1/(2*leaves-1)) * (1/(2*leaves-2))
                # (1/(2*leaves-2)) - 1/((2*leaves-2) * (2*leaves-1))
                # (2*leaves-1)/((2*leaves-2) * (2*leaves-1)) - 1/((2*leaves-2) * (2*leaves-1))
                # (2*leaves-2)/((2*leaves-2) * (2*leaves-1))
                # 1/(2*leaves-1)
                # There we go!
                if random.random() < 0.5:
                    return self.right.sample(1)
                else:
                    return self.left.sample(1)

        # Now comes the tricky part... n > 1 therefore we are NOT going to
        # sample this node. Its probability to be chosen is 0!
        # It HAS to be 0 since we are definitely sampling from one of its
        # children which means that this node will be blocked by those samples.
        # The difficult part now is to prove that the sampling the way we do it
        # is really random.

        if n > leaves/2:
            # if n > leaves/2, at least n-leaves/2 have to stay on either
            # side, otherweise the other one would have to sample more
            # nodes than possible.
            nMin = n-leaves/2
        else:
            nMin = 0
        # The rest n-2*nMin is the amount of samples that are free to fall
        # on either side
        free = n-2*nMin

        # Let's have a look at an example, suppose we were to distribute 5
        # samples among two children which have 4 leaves each.
        # Each child has to get at least 1 sample, so the free samples are 3.
        # There are 4 different ways to split the samples among the
        # children (left, right):
        # (1, 4), (2, 3), (3, 2), (4, 1)
        # The amount of unique sample combinations per child are
        # (7, 1), (11, 6), (6, 11), (1, 7)
        # The amount of total unique samples per possible split are
        #   7   ,   66  ,   66  ,    7
        # In case of the first and last split, all samples have a probability
        # of 1/7, this was already proven above.
        # Lets suppose we are good to go and the per sample probabilities for
        # the other two cases are (1/11, 1/6) and (1/6, 1/11), this way the
        # overall per sample probabilities for the splits would be:
        #  1/7  ,  1/66 , 1/66 , 1/7
        # If we used uniform random to determine the split, all splits would be
        # equally probable and therefore be multiplied with the same value (1/4)
        # But this would mean that NOT every sample is equally probable!
        # We need to know in advance how many sample combinations there will be
        # for a given split in order to find out the probability to choose it.
        # In fact, due to the restrictions, this becomes very nasty to
        # determine. So instead of solving it analytically, I do it numerically
        # with the method 'combs'. It gives me the amount of possible sample
        # combinations for a certain amount of samples and a given tree depth.
        # It will return 146 for this node and 7 for the outer and 66 for the
        # inner splits.
        # What we now do is, we take a number from [0, 146).
        # if it is smaller than 7, we sample from the first split,
        # if it is smaller than 7+66, we sample from the second split,
        # ...
        # This way we get the probabilities we need.

        r = random.randint(0, self.combs(n)-1)
        p = 0
        for addLeft in xrange(0, free+1):
            nLeft = nMin + addLeft
            nRight = n - nLeft

            p += (self.left.combs(nLeft) * self.right.combs(nRight))
            if r < p:
                return self.left.sample(nLeft) + self.right.sample(nRight)
        assert False, ('Something really strange happend, p did not sum up ' +
                       'to combs or r was too big')


def main():
    """
    Do a microbenchmark.
    """
    import timeit
    i = 1
    main.t = Tree()
    template = ' {:>2}  {:>5} {:>4}  {:<5}'
    print(template.format('i', 'depth', 'n', 'time (ms)'))
    N = 100
    for depth in [4, 8, 15, 16, 17, 18]:
        for n in [10, 50, 100, 150]:
            if n > 2**depth:
                time = '--'
            else:
                time = timeit.timeit(
                    'main.t.sample({}, depth={})'.format(n, depth), setup=
                    'from __main__ import main', number=N)*1000./N
            print(template.format(i, depth, n, time))
            i += 1


if __name__ == "__main__":
    main()

The benchmark output:

  i  depth    n  time (ms)
  1      4   10  0.182511806488
  2      4   50  --   
  3      4  100  --   
  4      4  150  --   
  5      8   10  0.397620201111
  6      8   50  1.66054964066
  7      8  100  2.90236949921
  8      8  150  3.48146915436
  9     15   10  0.804011821747
 10     15   50  3.7428188324
 11     15  100  7.34910964966
 12     15  150  10.8230614662
 13     16   10  0.804491043091
 14     16   50  3.66818904877
 15     16  100  7.09567070007
 16     16  150  10.404779911
 17     17   10  0.865840911865
 18     17   50  3.9999294281
 19     17  100  7.70257949829
 20     17  150  11.3758206367
 21     18   10  0.915451049805
 22     18   50  4.22935962677
 23     18  100  8.22361946106
 24     18  150  12.2081303596

10 samples of size 10 from a tree with depth 10:

['1111010111', '1110111010', '1010111010', '011110010', '0111100001', '011101110', '01110010', '01001111', '0001000100', '000001010']
['110', '0110101110', '0110001100', '0011110', '0001111011', '0001100010', '0001100001', '0001100000', '0000011010', '0000001111']
['11010000', '1011111101', '1010001101', '1001110001', '1001100110', '10001110', '011111110', '011001100', '0101110000', '001110101']
['11111101', '110111', '110110111', '1101010101', '1101001011', '1001001100', '100100010', '0100001010', '0100000111', '0010010110']
['111101000', '1110111101', '1101101', '1101000000', '1011110001', '0111111101', '01101011', '011010011', '01100010', '0101100110']
['1111110001', '11000110', '1100010100', '101010000', '1010010001', '100011001', '100000110', '0100001111', '001101100', '0001101101']
['111110010', '1110100', '1101000011', '101101', '101000101', '1000001010', '0111100', '0101010011', '0101000110', '000100111']
['111100111', '1110001110', '1100111111', '1100110010', '11000110', '1011111111', '0111111', '0110000100', '0100011', '0010110111']
['1101011010', '1011111', '1011100100', '1010000010', '10010', '1000010100', '0111011111', '01010101', '001101', '000101100']
['111111110', '111101001', '1110111011', '111011011', '1001011101', '1000010100', '0111010101', '010100110', '0100001101', '0010000000']

Frank · Answer

Mapping ids to strings. You can map numbers to your 0/1 vectors, as @BrodieG mentioned:

# some key objects

n_pool      = sum(2^(1:max_len))      # total number of indices
cuts        = cumsum(2^(1:max_len-1)) # new group starts
inds_by_g   = mapply(seq,cuts,cuts*2) # indices grouped by length

# the mapping to strings (one among many possibilities)

library(data.table)
get_01str <- function(id,max_len){
    cuts = cumsum(2^(1:max_len-1))
    g    = findInterval(id,cuts)
    gid  = id-cuts[g]+1

    data.table(g,gid)[,s:=
      do.call(paste,c(list(sep=""),lapply(
        seq(g[1]), 
        function(x) (gid-1) %/% 2^(x-1) %% 2
      )))
    ,by=g]$s      
}

Finding ids to drop. We'll sequentially drop ids from the sampling pool:

 # the mapping from one index to indices of nixed strings

get_nixstrs <- function(g,gid,max_len){

    cuts         = cumsum(2^(1:max_len-1))
    gids_child   = {
      x = gid%%2^sequence(g-1)
      ifelse(x,x,2^sequence(g-1))
    }
    ids_child    = gids_child+cuts[sequence(g-1)]-1

    ids_parent   = if (g==max_len) gid+cuts[g]-1 else {

      gids_par       = vector(mode="list",max_len)
      gids_par[[g]]  = gid
      for (gg in seq(g,max_len-1)) 
        gids_par[[gg+1]] = c(gids_par[[gg]],gids_par[[gg]]+2^gg)

      unlist(mapply(`+`,gids_par,cuts-1))
    }

    c(ids_child,ids_parent)
}

The indices are grouped by g, the number of characters, nchar(get_01str(id)). Because the indices are sorted by g, g=findInterval(id,cuts) is a faster route.

An index in group g, 1 < g < max_len has one "child" index of size g-1 and two parent indices of size g+1. For each child node, we take its child node until we hit g==1; and for each parent node, we take their pair of parent nodes until we hit g==max_len.

The structure of the tree is simplest in terms of the identifier within the group, gid. gid maps to the two parents, gid and gid+2^g; and reversing this mapping one finds the child.

Sampling

drawem <- function(n,max_len){
    cuts        = cumsum(2^(1:max_len-1))
    inds_by_g   = mapply(seq,cuts,cuts*2)

    oklens = (1:max_len)[ n <= 2^max_len*(1-2^(-(1:max_len)))+1 ]
    okinds = unlist(inds_by_g[oklens])

    mysamp = rep(0,n)
    for (i in 1:n){

        id        = if (length(okinds)==1) okinds else sample(okinds,1)
        g         = findInterval(id,cuts)
        gid       = id-cuts[g]+1
        nixed     = get_nixstrs(g,gid,max_len)

        # print(id); print(okinds); print(nixed)

        mysamp[i] = id
        okinds    = setdiff(okinds,nixed)
        if (!length(okinds)) break
    }

    res <- rep("",n)
    res[seq.int(i)] <- get_01str(mysamp[seq.int(i)],max_len)
    res
}

The oklens part integrates the OP's idea for omitting strings that are guaranteed to make the sampling impossible. However, even doing this, we may follow a sampling path that leaves us with no more options. Taking the OP's example of max_len=4 and n=10, we know we must drop 0 and 1 from consideration, but what happens if our first four draws are 00, 01, 11 and 10? Oh well, I guess we are out of luck. This is why you should actually define sampling probabilities. (The OP has another idea, for determining, at each step, which nodes will lead to an impossible state, but that seems a tall order.)

Illustration

# how the indices line up

n_pool = sum(2^(1:max_len)) 
pdt <- data.table(id=1:n_pool)
pdt[,g:=findInterval(id,cuts)]
pdt[,gid:=1:.N,by=g]
pdt[,s:=get_01str(id,max_len)]

# example run

set.seed(4); drawem(5,5)
# [1] "01100" "1"     "0001"  "0101"  "00101"

set.seed(4); drawem(8,4)
# [1] "1100" "0"    "111"  "101"  "1101" "100"  ""     ""

Benchmarks (older than those in @BrodieG's answer)

require(rbenchmark)
max_len = 8
n = 8

benchmark(
      jos_lp     = {
        pool <- unlist(lapply(seq_len(max_len),
          function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))
        sample.lp(pool, n)},
      bro_string = {pool <- make_pool(max_len);sample01(pool,n)},
      fra_num    = drawem(n,max_len),
      replications=5)[1:5]
#         test replications elapsed relative user.self
# 2 bro_string            5    0.05      2.5      0.05
# 3    fra_num            5    0.02      1.0      0.02
# 1     jos_lp            5    1.56     78.0      1.55

n = 12
max_len = 12
benchmark(
  bro_string={pool <- make_pool(max_len);sample01(pool,n)},
  fra_num=drawem(n,max_len),
  replications=5)[1:5]
#         test replications elapsed relative user.self
# 1 bro_string            5    0.54     6.75      0.51
# 2    fra_num            5    0.08     1.00      0.08

Other answers. There are two other answers:

jos_enum = {pool <- unlist(lapply(seq_len(max_len), 
    function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))
  get.template(pool, n)}
bro_num  = sample011(max_len,n)

I left out @josilber's enumeration method because it took far too long; and @BrodieG's numeric/index method because it didn't work at the time, but now does. See @BrodieG's updated answer for more benchmarking.

Speed vs correctness. While @josilber's answers are far slower (and for the enumeration method, apparently much more memory-intensive), they are guaranteed to draw a sample of size n on the first try. With @BrodieG's string method or this answer, you'll have to resample again and again in the hope of drawing a full n. With large max_len, that shouldn't be such a problem, I guess.

This answer scales better than bro_string because it doesn't require the construction of pool up front.

Random sample of character vector, without elements prefixing one another

Tags:

performance

r

combinatorics

jbaums

4 Answers

BrodieG

Tensibai

swenzel

Frank

Recent Activity

Donate For Us

Random sample of character vector, without elements prefixing one another

Tags:

performance

r

combinatorics

jbaums

4 Answers

BrodieG

Tensibai

swenzel

Frank

Related questions

Recent Activity

Donate For Us