Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sample without replacement

Tags:

tensorflow

How to sample without replacement in TensorFlow? Like numpy.random.choice(n, size=k, replace=False) for some very large integer n (e.g. 100k-100M), and smaller k (e.g. 100-10k). Also, I want it to be efficient and on the GPU, so other solutions like this with tf.py_func are not really an option for me. Anything which would use tf.range(n) or so is also not an option because n could be very large.

like image 667
Albert Avatar asked Mar 29 '18 14:03

Albert


People also ask

Is it better to sample with or without replacement?

In practice, sampling without replacement saves you the need to, well, make replacements. This has two benefits: You can just take a larger sample and consider it as multiple individual samples. Replacements in the real world can be costly, time-consuming or even nigh-impossible.

What means without replacement?

Without replacement means the same item cannot be selected more than once. Example 1: A PIN code at your bank is made up of 4 digits, with replacement. ( The same digit can be selected more than once) 10 X 10 X 10 X 10 = 10,000 combinations are possible.

What is sampling without replacement in probability?

Sampling without Replacement is a way to figure out probability without replacement. In other words, you don't replace the first item you choose before you choose a second. This dramatically changes the odds of choosing sample items.

How do you solve sampling without replacement?

In sampling without replacement, the formula for the standard deviation of all sample means for samples of size n must be modified by including a finite population correction. The formula becomes: where N is the population size, N=6 in this example, and n is the sample size, n=4 in this case.


2 Answers

This is one way:

n = ...
sample_size = ...
idx = tf.random_shuffle(tf.range(n))[:sample_size]

EDIT:

I had posted the answer below but then read the last line of your post. I don't think there is a good way to do it if you absolutely cannot produce a tensor with size O(n) (numpy.random.choice with replace=False is also implemented as a slice of a permutation). You could resort to a tf.while_loop until you have unique indices:

n = ...
sample_size = ...
idx = tf.zeros(sample_size, dtype=tf.int64)
idx = tf.while_loop(
    lambda i: tf.size(idx) == tf.size(tf.unique(idx)),
    lambda i: tf.random_uniform(sample_size, maxval=n, dtype=int64))

EDIT 2:

About the average number of iterations in the previous method. If we call n the number of possible values and k the length of the desired vector (with kn), the probability that an iteration is successful is:

p = product((n - (i - 1) / n) for i in 1 .. k)

Since each iteartion can be considered a Bernoulli trial, the average number of trials unitl first success is 1 / p (proof here). Here is a function that calculates the average numbre of trials in Python for some k and n values:

def avg_iter(k, n):
    if k > n or n <= 0 or k < 0:
        raise ValueError()
    avg_it = 1.0
    for p in (float(n) / (n - i) for i in range(k)):
        avg_it *= p
    return avg_it

And here are some results:

+-------+------+----------+
|   n   |  k   | Avg iter |
+-------+------+----------+
|    10 |    5 | 3.3      |
|   100 |   10 | 1.6      |
|  1000 |   10 | 1.1      |
|  1000 |  100 | 167.8    |
| 10000 |   10 | 1.0      |
| 10000 |  100 | 1.6      |
| 10000 | 1000 | 2.9e+22  |
+-------+------+----------+

You can see it varies wildy depending on the parameters.

It is possible, though, to construct a vector in a fixed number of steps, although the only algorithm I can think of is O(k2). In pure Python it goes like this:

import random

def sample_wo_replacement(n, k):
    sample = [0] * k
    for i in range(k):
        sample[i] = random.randint(0, n - 1 - len(sample))
    for i, v in reversed(list(enumerate(sample))):
        for p in reversed(sample[:i]):
            if v >= p:
                v += 1
        sample[i] = v
    return sample

random.seed(100)
print(sample_wo_replacement(10, 5))
# [2, 8, 9, 7, 1]
print(sample_wo_replacement(10, 10))
# [6, 5, 8, 4, 0, 9, 1, 2, 7, 3]

This is a possible way to do it in TensorFlow (not sure if the best one):

import tensorflow as tf

def sample_wo_replacement_tf(n, k):
    # First loop
    sample = tf.constant([], dtype=tf.int64)
    i = 0
    sample, _ = tf.while_loop(
        lambda sample, i: i < k,
        # This is ugly but I did not want to define more functions
        lambda sample, i: (tf.concat([sample,
                                      tf.random_uniform([1], maxval=tf.cast(n - tf.shape(sample)[0], tf.int64), dtype=tf.int64)],
                                     axis=0),
                           i + 1),
        [sample, i], shape_invariants=[tf.TensorShape((None,)), tf.TensorShape(())])
    # Second loop
    def inner_loop(sample, i):
        sample_size = tf.shape(sample)[0]
        v = sample[i]
        j = i - 1
        v, _ = tf.while_loop(
            lambda v, j: j >= 0,
            lambda v, j: (tf.cond(v >= sample[j], lambda: v + 1, lambda: v), j - 1),
            [v, j])
        return (tf.where(tf.equal(tf.range(sample_size), i), tf.tile([v], (sample_size,)), sample), i - 1)
    i = tf.shape(sample)[0] - 1
    sample, _ = tf.while_loop(lambda sample, i: i >= 0, inner_loop, [sample, i])
    return sample

And an example:

with tf.Graph().as_default(), tf.Session() as sess:
    tf.set_random_seed(100)
    sample = sample_wo_replacement_tf(10, 5)
    for i in range(10):
        print(sess.run(sample))
# [3 0 6 8 4]
# [5 4 8 9 3]
# [1 4 0 6 8]
# [8 9 5 6 7]
# [7 5 0 2 4]
# [8 4 5 3 7]
# [0 5 7 4 3]
# [2 0 3 8 6]
# [3 4 8 5 1]
# [5 7 0 2 9]

This is quite intesive on tf.while_loops, though, which are well-known not to be particularly fast in TensorFlow, so I wouldn't know how fast can you really get with this method without some kind of benchmarking.


EDIT 4:

One last possible method. You can divide the range of possible values (0 to n) in "chunks" of size c and pick a random amount of numbers from each chunk, then shuffle everything. The amount of memory that you use is limited by c, and you don't need nested loops. If n is divisible by c, then you should get about a perfect random distribution, otherwise values in the last "short" chunk would receive some extra probability (this may be negligible depending on the case). Here is a NumPy implementation. It is somewhat long to account for different corner cases and pitfalls, but if ck and n mod c = 0 several parts get simplified.

import numpy as np

def sample_chunked(n, k, chunk=None):
    chunk = chunk or n
    last_chunk = chunk
    parts = n // chunk
    # Distribute k among chunks
    max_p = min(float(chunk) / k, 1.0)
    max_p_last = max_p
    if n % chunk != 0:
        parts += 1
        last_chunk = n % chunk
        max_p_last = min(float(last_chunk) / k, 1.0)
    p = np.full(parts, 2)
    # Iterate until a valid distribution is found
    while not np.isclose(np.sum(p), 1) or np.any(p > max_p) or p[-1] > max_p_last:
        p = np.random.uniform(size=parts)
        p /= np.sum(p)
    dist = (k * p).astype(np.int64)
    sample_size = np.sum(dist)
    # Account for rounding errors
    while sample_size < k:
        i = np.random.randint(len(dist))
        while (dist[i] >= chunk) or (i == parts - 1 and dist[i] >= last_chunk):
            i = np.random.randint(len(dist))
        dist[i] += 1
        sample_size += 1
    while sample_size > k:
        i = np.random.randint(len(dist))
        while dist[i] == 0:
            i = np.random.randint(len(dist))
        dist[i] -= 1
        sample_size -= 1
    assert sample_size == k
    # Generate sample parts
    sample_parts = []
    for i, v in enumerate(np.nditer(dist)):
        if v <= 0:
            continue
        c = chunk if i < parts - 1 else last_chunk
        base = chunk * i
        sample_parts.append(base + np.random.choice(c, v, replace=False))
    sample = np.concatenate(sample_parts, axis=0)
    np.random.shuffle(sample)
    return sample

np.random.seed(100)
print(sample_chunked(15, 5, 4))
# [ 8  9 12 13  3]

A quick benchmark of sample_chunked(100000000, 100000, 100000) takes about 3.1 seconds in my computer, while I haven't been able to run the previous algorithm (sample_wo_replacement function above) to completion with the same parameters. It should be possible to implement it in TensorFlow, maybe using tf.TensorArray, although it would require significant effort to get it exactly right.

like image 140
jdehesa Avatar answered Oct 08 '22 07:10

jdehesa


use the gumbel-max trick here: https://github.com/tensorflow/tensorflow/issues/9260

z = -tf.log(-tf.log(tf.random_uniform(tf.shape(logits),0,1))) 
_, indices = tf.nn.top_k(logits + z,K)

indices are what you want. This tick is so easy~!

like image 33
Xiang Qi Avatar answered Oct 08 '22 06:10

Xiang Qi