Goal is to efficiently read data from remote (e.g. HDFS). With tensorflow dataset, I can either follow the guide here and use parallel_interleave to read from different files in remote host, like so
def input_fn():
files = tf.data.Dataset.list_files("hdfs:///path/to/dataset/train-*.tfrecord")
dataset = filenames.apply(
tf.data.experimental.parallel_interleave(
lambda filename: tf.data.TFRecordDataset(filename),
cycle_length=4))
dataset = dataset.map(map_func=parse_fn)
dataset = dataset.batch(batch_size=FLAGS.batch_size)
dataset = dataset.prefetch(buffer_size=FLAGS.prefetch_buffer_size)
return dataset
Or I can use num_parallel_reads, link, to read from different files in remote host, like so
def input_fn():
files = tf.data.Dataset.list_files("hdfs:///path/to/dataset/train-*.tfrecord")
dataset = tf.data.TFRecordDataset(files, num_parallel_reads=4)
dataset = dataset.map(map_func=parse_fn)
dataset = dataset.batch(batch_size=FLAGS.batch_size)
dataset = dataset.prefetch(buffer_size=FLAGS.prefetch_buffer_size)
return dataset
I assume they would both serve the same purpose, where 4 threads of my cpu are going to fetch data from 4 different files, therefore have better throughput then reading 1 file. Is there a difference in this case to go for either approach?
I also assume that the first method would read from different files on each batch, more like a breadth-first search of my remote files, while the second approach is more like a depth-first search of my remote files. When it's local filesystem with low latency, maybe it doesn't matter, but for remote, like HDFS, which should be the preferred way to go?
I just went through the source code of both TFRecordDataset and parallel_interleave. Note that I am looking at tf.data.experimental, as the tf.contrib.data one is deprecated. Funnily enough, they both call on the same class, ParallelInterleaveDataset to make use of parallel reading. I guess it then becomes the option of how better you can optimize your pipeline because you can use parameters like block_length, sloppy, buffer_output_elements and prefetch_input_elements when using parallel_interleave to potentially speed up your pipeline, while also imparting some randomness in ordering.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With