Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use properly Tensorflow Dataset with batch?

I am new to Tensorflow and deep learning, and I am struggling with the Dataset class. I tried a lot of things and I can’t find a good solution.

What I am trying

I have a large amount of images (500k+) to train my DNN with. This is a denoising autoencoder so I have a pair of each image. I am using the dataset class of TF to manage the data, but I think I use it really badly.

Here is how I load the filenames in a dataset:

class Data:
def __init__(self, in_path, out_path):
    self.nb_images = 512
    self.test_ratio = 0.2
    self.batch_size = 8

    # load filenames in input and outputs
    inputs, outputs, self.nb_images = self._load_data_pair_paths(in_path, out_path, self.nb_images)

    self.size_training = self.nb_images - int(self.nb_images * self.test_ratio)
    self.size_test = int(self.nb_images * self.test_ratio)

    # split arrays in training / validation
    test_data_in, training_data_in = self._split_test_data(inputs, self.test_ratio)
    test_data_out, training_data_out = self._split_test_data(outputs, self.test_ratio)

    # transform array to tf.data.Dataset
    self.train_dataset = tf.data.Dataset.from_tensor_slices((training_data_in, training_data_out))
    self.test_dataset = tf.data.Dataset.from_tensor_slices((test_data_in, test_data_out))

I have a function to call at each epoch that will prepare the dataset. It shuffles the filenames, and transforms filenames to images and batch data.

def get_batched_data(self, seed, batch_size):
    nb_batch = int(self.size_training / batch_size)

    def img_to_tensor(path_in, path_out):
        img_string_in = tf.read_file(path_in)
        img_string_out = tf.read_file(path_out)
        im_in = tf.image.decode_jpeg(img_string_in, channels=1)
        im_out = tf.image.decode_jpeg(img_string_out, channels=1)
        return im_in, im_out

    t_datas = self.train_dataset.shuffle(self.size_training, seed=seed)
    t_datas = t_datas.map(img_to_tensor)
    t_datas = t_datas.batch(batch_size)
    return t_datas

Now during the training, at each epoch we call the get_batched_data function, make an iterator, and run it for each batch, then feed the array to the optimizer operation.

for epoch in range(nb_epoch):
    sess_iter_in = tf.Session()
    sess_iter_out = tf.Session()

    batched_train = data.get_batched_data(epoch)
    iterator_train = batched_train.make_one_shot_iterator()
    in_data, out_data = iterator_train.get_next()

    total_batch = int(data.size_training / batch_size)
    for batch in range(total_batch):
        print(f"{batch + 1} / {total_batch}")
        in_images = sess_iter_in.run(in_data).reshape((-1, 64, 64, 1))
        out_images = sess_iter_out.run(out_data).reshape((-1, 64, 64, 1))
        sess.run(optimizer, feed_dict={inputs: in_images,
                                       outputs: out_images})

What do I need ?

I need to have a pipeline that loads only the images of the current batch (otherwise it will not fit in memory) and I want to shuffle the dataset in a different way for each epoch.

Questions and problems

First question, am I using the Dataset class in a good way? I saw very different things on the internet, for example in this blog post the dataset is used with a placeholder and fed during the learning with the datas. It seems strange because the data are all in an array, so loaded in memory. I don't see the point of using tf.data.dataset in this case.

I found solution by using repeat(epoch) on the dataset, like this, but the shuffle will not be different for each epoch in this case.

The second problem with my implementation is that I have an OutOfRangeError in some cases. With a small amount of data (512 like in the exemple) it works fine, but with a bigger amount of data, the error occurs. I thought it was because of a bad calculation of the number of batch due to bad rounding, or when the last batch has a smaller amount of data, but it happens in batch 32 out of 115... Is there any way to know the number of batch created after a batch(n) call on dataset?

Sorry for this loooonng question, but I've been struggling with this for a few days.

like image 292
petit_w Avatar asked Oct 19 '18 10:10

petit_w


1 Answers

As far as I know, Official Performance Guideline is the best teaching material to make input pipelines.

I want to shuffle the dataset in a different way for each epoch.

Using shuffle() and repeat(), you can get different shuffle pattern for each epochs. You can confirm it with the following code

dataset = tf.data.Dataset.from_tensor_slices([1,2,3,4])
dataset = dataset.shuffle(4)
dataset = dataset.repeat(3)

iterator = dataset.make_one_shot_iterator()
x = iterator.get_next()

with tf.Session() as sess:
    for i in range(10):
        print(sess.run(x))

You can also use tf.contrib.data.shuffle_and_repeat as the mentioned by the above official page.

There are some problems in your code outside of creating data pipelines. You confuse graph construction with graph execution. You are repeating to create data input pipeline, so there are many redundant input pipelines as many as epochs. You can observe the redundant pipelines by Tensorboard.

You should place your graph construction code outside of loop as the following code (pseudo code)

batched_train = data.get_batched_data()
iterator = batched_train.make_initializable_iterator()
in_data, out_data = iterator_train.get_next()

for epoch in range(nb_epoch):
    # reset iterator's state
    sess.run(iterator.initializer)

    try:
        while True:
            in_images = sess.run(in_data).reshape((-1, 64, 64, 1))
            out_images = sess.run(out_data).reshape((-1, 64, 64, 1))
            sess.run(optimizer, feed_dict={inputs: in_images,
                                           outputs: out_images})
    except tf.errors.OutOfRangeError:
        pass

Moreover there are some unimportant inefficient code. You loaded a list of file path with from_tensor_slices(), so the list was embedded in your graph. (See https://www.tensorflow.org/guide/datasets#consuming_numpy_arrays for detail)

You would be better off using prefetch, and decreasing sess.run call by combining your graph.

like image 182
saket Avatar answered Sep 26 '22 13:09

saket