Sample without replacement

Tags:

tensorflow

How to sample without replacement in TensorFlow? Like numpy.random.choice(n, size=k, replace=False) for some very large integer n (e.g. 100k-100M), and smaller k (e.g. 100-10k). Also, I want it to be efficient and on the GPU, so other solutions like this with tf.py_func are not really an option for me. Anything which would use tf.range(n) or so is also not an option because n could be very large.

667

asked Mar 29 '18 14:03

Albert

2 Answers

This is one way:

n = ...
sample_size = ...
idx = tf.random_shuffle(tf.range(n))[:sample_size]

EDIT:

I had posted the answer below but then read the last line of your post. I don't think there is a good way to do it if you absolutely cannot produce a tensor with size O(n) (numpy.random.choice with replace=False is also implemented as a slice of a permutation). You could resort to a tf.while_loop until you have unique indices:

n = ...
sample_size = ...
idx = tf.zeros(sample_size, dtype=tf.int64)
idx = tf.while_loop(
    lambda i: tf.size(idx) == tf.size(tf.unique(idx)),
    lambda i: tf.random_uniform(sample_size, maxval=n, dtype=int64))

EDIT 2:

About the average number of iterations in the previous method. If we call n the number of possible values and k the length of the desired vector (with k ≤ n), the probability that an iteration is successful is:

p = product((n - (i - 1) / n) for i in 1 .. k)

Since each iteartion can be considered a Bernoulli trial, the average number of trials unitl first success is 1 / p (proof here). Here is a function that calculates the average numbre of trials in Python for some k and n values:

def avg_iter(k, n):
    if k > n or n <= 0 or k < 0:
        raise ValueError()
    avg_it = 1.0
    for p in (float(n) / (n - i) for i in range(k)):
        avg_it *= p
    return avg_it

And here are some results:

+-------+------+----------+
|   n   |  k   | Avg iter |
+-------+------+----------+
|    10 |    5 | 3.3      |
|   100 |   10 | 1.6      |
|  1000 |   10 | 1.1      |
|  1000 |  100 | 167.8    |
| 10000 |   10 | 1.0      |
| 10000 |  100 | 1.6      |
| 10000 | 1000 | 2.9e+22  |
+-------+------+----------+

You can see it varies wildy depending on the parameters.

It is possible, though, to construct a vector in a fixed number of steps, although the only algorithm I can think of is O(k²). In pure Python it goes like this:

import random

def sample_wo_replacement(n, k):
    sample = [0] * k
    for i in range(k):
        sample[i] = random.randint(0, n - 1 - len(sample))
    for i, v in reversed(list(enumerate(sample))):
        for p in reversed(sample[:i]):
            if v >= p:
                v += 1
        sample[i] = v
    return sample

random.seed(100)
print(sample_wo_replacement(10, 5))
# [2, 8, 9, 7, 1]
print(sample_wo_replacement(10, 10))
# [6, 5, 8, 4, 0, 9, 1, 2, 7, 3]

This is a possible way to do it in TensorFlow (not sure if the best one):

import tensorflow as tf

def sample_wo_replacement_tf(n, k):
    # First loop
    sample = tf.constant([], dtype=tf.int64)
    i = 0
    sample, _ = tf.while_loop(
        lambda sample, i: i < k,
        # This is ugly but I did not want to define more functions
        lambda sample, i: (tf.concat([sample,
                                      tf.random_uniform([1], maxval=tf.cast(n - tf.shape(sample)[0], tf.int64), dtype=tf.int64)],
                                     axis=0),
                           i + 1),
        [sample, i], shape_invariants=[tf.TensorShape((None,)), tf.TensorShape(())])
    # Second loop
    def inner_loop(sample, i):
        sample_size = tf.shape(sample)[0]
        v = sample[i]
        j = i - 1
        v, _ = tf.while_loop(
            lambda v, j: j >= 0,
            lambda v, j: (tf.cond(v >= sample[j], lambda: v + 1, lambda: v), j - 1),
            [v, j])
        return (tf.where(tf.equal(tf.range(sample_size), i), tf.tile([v], (sample_size,)), sample), i - 1)
    i = tf.shape(sample)[0] - 1
    sample, _ = tf.while_loop(lambda sample, i: i >= 0, inner_loop, [sample, i])
    return sample

And an example:

with tf.Graph().as_default(), tf.Session() as sess:
    tf.set_random_seed(100)
    sample = sample_wo_replacement_tf(10, 5)
    for i in range(10):
        print(sess.run(sample))
# [3 0 6 8 4]
# [5 4 8 9 3]
# [1 4 0 6 8]
# [8 9 5 6 7]
# [7 5 0 2 4]
# [8 4 5 3 7]
# [0 5 7 4 3]
# [2 0 3 8 6]
# [3 4 8 5 1]
# [5 7 0 2 9]

This is quite intesive on tf.while_loops, though, which are well-known not to be particularly fast in TensorFlow, so I wouldn't know how fast can you really get with this method without some kind of benchmarking.

EDIT 4:

One last possible method. You can divide the range of possible values (0 to n) in "chunks" of size c and pick a random amount of numbers from each chunk, then shuffle everything. The amount of memory that you use is limited by c, and you don't need nested loops. If n is divisible by c, then you should get about a perfect random distribution, otherwise values in the last "short" chunk would receive some extra probability (this may be negligible depending on the case). Here is a NumPy implementation. It is somewhat long to account for different corner cases and pitfalls, but if c ≥ k and n mod c = 0 several parts get simplified.

import numpy as np

def sample_chunked(n, k, chunk=None):
    chunk = chunk or n
    last_chunk = chunk
    parts = n // chunk
    # Distribute k among chunks
    max_p = min(float(chunk) / k, 1.0)
    max_p_last = max_p
    if n % chunk != 0:
        parts += 1
        last_chunk = n % chunk
        max_p_last = min(float(last_chunk) / k, 1.0)
    p = np.full(parts, 2)
    # Iterate until a valid distribution is found
    while not np.isclose(np.sum(p), 1) or np.any(p > max_p) or p[-1] > max_p_last:
        p = np.random.uniform(size=parts)
        p /= np.sum(p)
    dist = (k * p).astype(np.int64)
    sample_size = np.sum(dist)
    # Account for rounding errors
    while sample_size < k:
        i = np.random.randint(len(dist))
        while (dist[i] >= chunk) or (i == parts - 1 and dist[i] >= last_chunk):
            i = np.random.randint(len(dist))
        dist[i] += 1
        sample_size += 1
    while sample_size > k:
        i = np.random.randint(len(dist))
        while dist[i] == 0:
            i = np.random.randint(len(dist))
        dist[i] -= 1
        sample_size -= 1
    assert sample_size == k
    # Generate sample parts
    sample_parts = []
    for i, v in enumerate(np.nditer(dist)):
        if v <= 0:
            continue
        c = chunk if i < parts - 1 else last_chunk
        base = chunk * i
        sample_parts.append(base + np.random.choice(c, v, replace=False))
    sample = np.concatenate(sample_parts, axis=0)
    np.random.shuffle(sample)
    return sample

np.random.seed(100)
print(sample_chunked(15, 5, 4))
# [ 8  9 12 13  3]

A quick benchmark of sample_chunked(100000000, 100000, 100000) takes about 3.1 seconds in my computer, while I haven't been able to run the previous algorithm (sample_wo_replacement function above) to completion with the same parameters. It should be possible to implement it in TensorFlow, maybe using tf.TensorArray, although it would require significant effort to get it exactly right.

140

answered Oct 08 '22 07:10

jdehesa

use the gumbel-max trick here: https://github.com/tensorflow/tensorflow/issues/9260

z = -tf.log(-tf.log(tf.random_uniform(tf.shape(logits),0,1))) 
_, indices = tf.nn.top_k(logits + z,K)

indices are what you want. This tick is so easy~!

answered Oct 08 '22 06:10

Xiang Qi

Related questions
                            
                                OpenAI GPT-2 model use with TensorFlow JS
                            
                                TensorFlow - why doesn't this sofmax regression learn anything?
                            
                                Tensorflow not using GPU
                            
                                Implementing a many-to-many LSTM in TensorFlow?
                            
                                Tensorflow: why is zip() function used in the steps involving applying the gradients?
                            
                                How does tf.train.batch create a batch
                            
                                3D Convolutional Neural Network input shape
                            
                                How to run Tensorboard and jupyter concurrently with docker?
                            
                                Resize 3D data in tensorflow like tf.image.resize_images
                            
                                How to load checkpoint and inference with C++ for tensorflow?
                            
                                Stateful LSTM fails to predict due to batch_size issue
                            
                                legacy_init_op in TensorFlow Serving
                            
                                tf.contrib.data.Dataset seems does not support SparseTensor
                            
                                tf.GraphKeys.TRAINABLE_VARIABLES on output_graph.pb resulting in empty list
                            
                                How many epochs should Word2Vec be trained? What is a recommended training dataset?
                            
                                How to use the function merge and switch of tensorflow?
                            
                                Tensorflow: Why my code is running slower and slower?
                            
                                Why use fixed padding when building resnet model in tensorflow
                            
                                Neural network: estimating sine wave frequency
                            
                                Why does defining tf.Session with and without context manager in Tensorflow result in different behaviour?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With