Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use parallel_interleave in TensorFlow

I am reading the code in TensorFlow benchmarks repo. The following piece of code is the part that creates TensorFlow dataset from TFRecord files:

ds = tf.data.TFRecordDataset.list_files(tfrecord_file_names)
ds = ds.apply(interleave_ops.parallel_interleave(tf.data.TFRecordDataset, cycle_length=10))

I am trying to change this code to create dataset directly from JPEG image files:

ds = tf.data.Dataset.from_tensor_slices(jpeg_file_names)
ds = ds.apply(interleave_ops.parallel_interleave(?, cycle_length=10))

I don't know what to write in the ? place. The map_func in parallel_interleave() is __init__() of tf.data.TFRecordDataset class for TFRecord files, but I don't know what to write for JPEG files.

We don't need to do any transformations here. Because we will zip two datasets and then do the transformations later. The code is as follows:

counter = tf.data.Dataset.range(batch_size)
ds = tf.data.Dataset.zip((ds, counter))
ds = ds.apply( \
     batching.map_and_batch( \
     map_func=preprocess_fn, \
     batch_size=batch_size, \
     num_parallel_batches=num_splits))

Because we don't need transformation in ? place, I tried to use an empty map_func, but there is error "map_funcmust return aDataset` object". I also tried to use tf.data.Dataset, but the output says Dataset is an abstract class that is not allowed to put there.

Anyone can help this? Thanks very much.

like image 748
silence_lamb Avatar asked Apr 26 '18 15:04

silence_lamb


1 Answers

parallel_interleave is useful when you have a transformation that transforms each element of a source dataset into multiple elements into the destination dataset. I'm not sure why they use it in the benchmarks repo like that, when they could have just used a map with parallel calls.

Here's how I suggest using parallel_interleave for reading images from several directories, each containing one class:

classes = sorted(glob(directory + '/*/')) # final slash selects directories only
num_classes = len(classes)

labels = np.arange(num_classes, dtype=np.int32)

dirs = DS.from_tensor_slices((classes, labels))               # 1
files = dirs.apply(tf.contrib.data.parallel_interleave(
    get_files, cycle_length=num_classes, block_length=4,      # 2
    sloppy=False)) # False is important ! Otherwise it mixes labels
files = files.cache()
imgs = files.map(read_decode, num_parallel_calls=20)\.        # 3
            .apply(tf.contrib.data.shuffle_and_repeat(100))\
            .batch(batch_size)\
            .prefetch(5)

There are three steps. First, we get the list of directories and their labels (#1).

Then, we map these to a dataset of files. But if we do a simple .flatmap(), we will end up with all the files of label 0, followed by all the files of label 1, then 2 etc ... Then we'd need really large shuffle buffers to get a meaningful shuffle.

So, instead, we apply parallel_interleave (#2). Here is the get_files():

def get_files(dir_path, label):
    globbed = tf.string_join([dir_path, '*.jpg'])
    files = tf.matching_files(globbed)

    num_files = tf.shape(files)[0] # in the directory
    labels = tf.tile([label], [num_files, ]) # expand label to all files
    return DS.from_tensor_slices((files, labels))

Using parallel_interleave ensures the list_files of each directory is run in parallel, so by the time the first block_length files are listed from the first directory, the first block_length files from the 2nd directory will also be available (also from 3rd, 4th etc). Moreover, the resulting dataset will contain interleaved blocks of each label, e.g. 1 1 1 1 2 2 2 2 3 3 3 3 3 1 1 1 1 ... (for 3 classes and block_length=4)

Finally, we read the images from the list of files (#3). Here is read_and_decode():

def read_decode(path, label):
    img = tf.image.decode_image(tf.read_file(path), channels=3)
    img = tf.image.resize_bilinear(tf.expand_dims(img, axis=0), target_size)
    img = tf.squeeze(img, 0)
    img = preprocess_fct(img) # should work with Tensors !

    label = tf.one_hot(label, num_classes)
    img = tf.Print(img, [path, label], 'Read_decode')
    return (img, label)

This function takes an image path and its label and returns a tensor for each: image tensor for the path, and one_hot encoding for the label. This is also the place where you can do all the transformations on the image. Here, I do resizing and basic pre-processing.

like image 78
Ciprian Tomoiagă Avatar answered Sep 30 '22 01:09

Ciprian Tomoiagă