Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

TensorFlow DataSet `from_generator` with variable batch size

I'm trying to use the TensorFlow Dataset API to read an HDF5 file, using the from_generator method. Everything works fine unless the batch size does not evenly divide into the number of events. I don't quite see how to make a flexible batch using the API.

If things don't divide evenly, you get errors like:

2018-08-31 13:47:34.274303: W tensorflow/core/framework/op_kernel.cc:1263] Invalid argument: ValueError: `generator` yielded an element of shape (1, 28, 28, 1) where an element of shape (11, 28, 28, 1) was expected.
Traceback (most recent call last):

  File "/Users/perdue/miniconda3/envs/py3a/lib/python3.6/site-packages/tensorflow/python/ops/script_ops.py", line 206, in __call__
    ret = func(*args)

  File "/Users/perdue/miniconda3/envs/py3a/lib/python3.6/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 452, in generator_py_func
    "of shape %s was expected." % (ret_array.shape, expected_shape))

ValueError: `generator` yielded an element of shape (1, 28, 28, 1) where an element of shape (11, 28, 28, 1) was expected.

I have a script that reproduces the error (and instructions to get the several MB required data file - Fashion MNIST) here:

https://gist.github.com/gnperdue/b905a9c2dd4c08b53e0539d6aa3d3dc6

The most important code is probably:

def make_fashion_dset(file_name, batch_size, shuffle=False):
    dgen = _make_fashion_generator_fn(file_name, batch_size)
    features_shape = [batch_size, 28, 28, 1]
    labels_shape = [batch_size, 10]
    ds = tf.data.Dataset.from_generator(
        dgen, (tf.float32, tf.uint8),
        (tf.TensorShape(features_shape), tf.TensorShape(labels_shape))
    )
    ...

where dgen is a generator function reading from the hdf5:

def _make_fashion_generator_fn(file_name, batch_size):
    reader = FashionHDF5Reader(file_name)
    nevents = reader.openf()

    def example_generator_fn():
        start_idx, stop_idx = 0, batch_size
        while True:
            if start_idx >= nevents:
                reader.closef()
                return
            yield reader.get_examples(start_idx, stop_idx)
            start_idx, stop_idx = start_idx + batch_size, stop_idx + batch_size

    return example_generator_fn

The core of the problem is we have to declare the tensor shapes in from_generator, but we need the flexibility to change that shape down the line while iterating.

There are some workarounds - drop the last few samples to get even division, or just use a batch size of 1... but the first is bad if you can't lose any samples and a batch size of 1 is very slow.

Any ideas or comments? Thanks!

like image 520
Gabriel Perdue Avatar asked Feb 03 '23 22:02

Gabriel Perdue


1 Answers

When specifying Tensor shapes in from_generator, you can use None as an element to specify variable-sized dimensions. This way you can accommodate batches of different sizes, in particular "leftover" batches that are a bit smaller than your requested batch size. So you would use

def make_fashion_dset(file_name, batch_size, shuffle=False):
    dgen = _make_fashion_generator_fn(file_name, batch_size)
    features_shape = [None, 28, 28, 1]
    labels_shape = [None, 10]
    ds = tf.data.Dataset.from_generator(
        dgen, (tf.float32, tf.uint8),
        (tf.TensorShape(features_shape), tf.TensorShape(labels_shape))
    )
    ...
like image 123
xdurch0 Avatar answered Feb 06 '23 11:02

xdurch0