Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Output differences when changing order of batch(), shuffle() and repeat()

I have created a tensorflow dataset, made it repeatable, shuffled it, divided it into batches, and have constructed an iterator to get the next batch. But when I do this, sometimes the elements are repetitive (within and among batches), especially for small datasets. Why?

like image 430
Miladiouss Avatar asked Apr 19 '18 08:04

Miladiouss


People also ask

What does shuffle do in Tensorflow?

shuffle() method randomly shuffles a tensor along its first dimension. Parameters: buffer_size: This is the number of elements from which the new dataset will be sampled. seed[optional]: It is an optional parameter used to create a random seed for the distribution, to see the same results use same seed.

What does dataset repeat do?

Dataset : repeat( count=0 ) The method repeats the dataset count number of times.

What is Buffer_size?

buffer_size: A tf. int64 scalar tf. Tensor, representing the maximum number elements that will be buffered when prefetching.

What is TF data Autotune?

tf. data builds a performance model of the input pipeline and runs an optimization algorithm to find a good allocation of its CPU budget across all parameters specified as AUTOTUNE .


2 Answers

Unlike what stated in your own answer, no, shuffling and then repeating won't fix your problems.

The key source of your problem is that you batch, then shuffle/repeat. That way, the items in your batches will always be taken from contiguous samples in the input dataset. Batching should be one of the last operations you do in your input pipeline.

Expanding the question slightly.

Now, there is a difference in the order in which you shuffle, repeat and batch, but it's not what you think. Quoting from the input pipeline performance guide:

If the repeat transformation is applied before the shuffle transformation, then the epoch boundaries are blurred. That is, certain elements can be repeated before other elements appear even once. On the other hand, if the shuffle transformation is applied before the repeat transformation, then performance might slow down at the beginning of each epoch related to initialization of the internal state of the shuffle transformation. In other words, the former (repeat before shuffle) provides better performance, while the latter (shuffle before repeat) provides stronger ordering guarantees.

Recapping

  • Repeat, then shuffle: you lose the guarantee that all samples are processed in one epoch.
  • Shuffle, then repeat: it is guaranteed that all samples will be processed before the next repeat begins, but you lose (slightly) in performance.

Whichever you choose, do that before batching.

like image 66
GPhilo Avatar answered Sep 30 '22 19:09

GPhilo


You must shuffle first, and then repeat!

As the following two codes show, the order of shuffling and repeating matters.

Worst ordering:

import tensorflow as tf

ds = tf.data.Dataset.range(10)
ds = ds.batch(2)
ds = ds.repeat()
ds = ds.shuffle(100000)
iterator   = ds.make_one_shot_iterator()
next_batch = iterator.get_next()

with tf.Session() as sess:
    for i in range(15):
        if i % (10//2) == 0:
            print("------------")
        print("{:02d}:".format(i), next_batch.eval())

Output:

------------
00: [6 7]
01: [2 3]
02: [6 7]
03: [0 1]
04: [8 9]
------------
05: [6 7]
06: [4 5]
07: [6 7]
08: [4 5]
09: [0 1]
------------
10: [2 3]
11: [0 1]
12: [0 1]
13: [2 3]
14: [4 5]

Bad Ordering:

import tensorflow as tf

ds = tf.data.Dataset.range(10)
ds = ds.batch(2)
ds = ds.shuffle(100000)
ds = ds.repeat()
iterator   = ds.make_one_shot_iterator()
next_batch = iterator.get_next()

with tf.Session() as sess:
    for i in range(15):
        if i % (10//2) == 0:
            print("------------")
        print("{:02d}:".format(i), next_batch.eval())

Output:

------------
00: [4 5]
01: [6 7]
02: [8 9]
03: [0 1]
04: [2 3]
------------
05: [0 1]
06: [4 5]
07: [8 9]
08: [2 3]
09: [6 7]
------------
10: [0 1]
11: [4 5]
12: [8 9]
13: [2 3]
14: [6 7]

Best Ordering:

Inspired by GPhilo answer, the order of batching also matter. For batches to be different in each epoch, one must shuffle first, then repeat, and finally batch. As it can be seen in the output, all batches are unique, unlike the other.

import tensorflow as tf

ds = tf.data.Dataset.range(10)

ds = ds.shuffle(100000)
ds = ds.repeat()
ds = ds.batch(2)

iterator   = ds.make_one_shot_iterator()
next_batch = iterator.get_next()

with tf.Session() as sess:
    for i in range(15):
        if i % (10//2) == 0:
            print("------------")
        print("{:02d}:".format(i), next_batch.eval())

Output:

------------
00: [2 5]
01: [1 8]
02: [9 6]
03: [3 4]
04: [7 0]
------------
05: [4 3]
06: [0 2]
07: [1 9]
08: [6 5]
09: [8 7]
------------
10: [7 3]
11: [5 9]
12: [4 1]
13: [8 6]
14: [0 2]
like image 35
Miladiouss Avatar answered Sep 30 '22 18:09

Miladiouss