I am reading the code in TensorFlow benchmarks repo. The following piece of code is the part that creates TensorFlow dataset from TFRecord files:
ds = tf.data.TFRecordDataset.list_files(tfrecord_file_names)
ds = ds.apply(interleave_ops.parallel_interleave(tf.data.TFRecordDataset, cycle_length=10))
I am trying to change this code to create dataset directly from JPEG image files:
ds = tf.data.Dataset.from_tensor_slices(jpeg_file_names)
ds = ds.apply(interleave_ops.parallel_interleave(?, cycle_length=10))
I don't know what to write in the ? place. The map_func in parallel_interleave() is __init__() of tf.data.TFRecordDataset class for TFRecord files, but I don't know what to write for JPEG files.
We don't need to do any transformations here. Because we will zip two datasets and then do the transformations later. The code is as follows:
counter = tf.data.Dataset.range(batch_size)
ds = tf.data.Dataset.zip((ds, counter))
ds = ds.apply( \
batching.map_and_batch( \
map_func=preprocess_fn, \
batch_size=batch_size, \
num_parallel_batches=num_splits))
Because we don't need transformation in ? place, I tried to use an empty map_func, but there is error "map_funcmust return a
Dataset` object". I also tried to use tf.data.Dataset, but the output says Dataset is an abstract class that is not allowed to put there.
Anyone can help this? Thanks very much.
parallel_interleave
is useful when you have a transformation that transforms each element of a source dataset into multiple elements into the destination dataset. I'm not sure why they use it in the benchmarks repo like that, when they could have just used a map
with parallel calls.
Here's how I suggest using parallel_interleave
for reading images from several directories, each containing one class:
classes = sorted(glob(directory + '/*/')) # final slash selects directories only
num_classes = len(classes)
labels = np.arange(num_classes, dtype=np.int32)
dirs = DS.from_tensor_slices((classes, labels)) # 1
files = dirs.apply(tf.contrib.data.parallel_interleave(
get_files, cycle_length=num_classes, block_length=4, # 2
sloppy=False)) # False is important ! Otherwise it mixes labels
files = files.cache()
imgs = files.map(read_decode, num_parallel_calls=20)\. # 3
.apply(tf.contrib.data.shuffle_and_repeat(100))\
.batch(batch_size)\
.prefetch(5)
There are three steps. First, we get the list of directories and their labels (#1
).
Then, we map these to a dataset of files. But if we do a simple .flatmap()
, we will end up with all the files of label 0
, followed by all the files of label 1
, then 2
etc ... Then we'd need really large shuffle buffers to get a meaningful shuffle.
So, instead, we apply parallel_interleave
(#2
). Here is the get_files()
:
def get_files(dir_path, label):
globbed = tf.string_join([dir_path, '*.jpg'])
files = tf.matching_files(globbed)
num_files = tf.shape(files)[0] # in the directory
labels = tf.tile([label], [num_files, ]) # expand label to all files
return DS.from_tensor_slices((files, labels))
Using parallel_interleave
ensures the list_files
of each directory is run in parallel, so by the time the first block_length
files are listed from the first directory, the first block_length
files from the 2nd directory will also be available (also from 3rd, 4th etc). Moreover, the resulting dataset will contain interleaved blocks of each label, e.g. 1 1 1 1 2 2 2 2 3 3 3 3 3 1 1 1 1 ...
(for 3 classes and block_length=4
)
Finally, we read the images from the list of files (#3
). Here is read_and_decode()
:
def read_decode(path, label):
img = tf.image.decode_image(tf.read_file(path), channels=3)
img = tf.image.resize_bilinear(tf.expand_dims(img, axis=0), target_size)
img = tf.squeeze(img, 0)
img = preprocess_fct(img) # should work with Tensors !
label = tf.one_hot(label, num_classes)
img = tf.Print(img, [path, label], 'Read_decode')
return (img, label)
This function takes an image path and its label and returns a tensor for each: image tensor for the path, and one_hot encoding for the label. This is also the place where you can do all the transformations on the image. Here, I do resizing and basic pre-processing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With