Which is better when reading from remote hosts like HDFS, TFRecordDataset+num_parallel_read? Or parallel_interleave

Question

Goal is to efficiently read data from remote (e.g. HDFS). With tensorflow dataset, I can either follow the guide here and use parallel_interleave to read from different files in remote host, like so

def input_fn():
  files = tf.data.Dataset.list_files("hdfs:///path/to/dataset/train-*.tfrecord")
  dataset = filenames.apply(
      tf.data.experimental.parallel_interleave(
          lambda filename: tf.data.TFRecordDataset(filename),
          cycle_length=4))
  dataset = dataset.map(map_func=parse_fn)
  dataset = dataset.batch(batch_size=FLAGS.batch_size)
  dataset = dataset.prefetch(buffer_size=FLAGS.prefetch_buffer_size)
  return dataset

Or I can use num_parallel_reads, link, to read from different files in remote host, like so

def input_fn():
  files = tf.data.Dataset.list_files("hdfs:///path/to/dataset/train-*.tfrecord")
  dataset = tf.data.TFRecordDataset(files, num_parallel_reads=4)
  dataset = dataset.map(map_func=parse_fn)
  dataset = dataset.batch(batch_size=FLAGS.batch_size)
  dataset = dataset.prefetch(buffer_size=FLAGS.prefetch_buffer_size)
  return dataset

I assume they would both serve the same purpose, where 4 threads of my cpu are going to fetch data from 4 different files, therefore have better throughput then reading 1 file. Is there a difference in this case to go for either approach?

I also assume that the first method would read from different files on each batch, more like a breadth-first search of my remote files, while the second approach is more like a depth-first search of my remote files. When it's local filesystem with low latency, maybe it doesn't matter, but for remote, like HDFS, which should be the preferred way to go?

kvish · Accepted Answer

I just went through the source code of both TFRecordDataset and parallel_interleave. Note that I am looking at tf.data.experimental, as the tf.contrib.data one is deprecated. Funnily enough, they both call on the same class, ParallelInterleaveDataset to make use of parallel reading. I guess it then becomes the option of how better you can optimize your pipeline because you can use parameters like block_length, sloppy, buffer_output_elements and prefetch_input_elements when using parallel_interleave to potentially speed up your pipeline, while also imparting some randomness in ordering.

Which is better when reading from remote hosts like HDFS, TFRecordDataset+num_parallel_read? Or parallel_interleave

Tags:

python

tensorflow

hdfs

tensorflow-datasets

tfrecord

Kevin Yen

1 Answers

kvish

Recent Activity

Donate For Us

Which is better when reading from remote hosts like HDFS, TFRecordDataset+num_parallel_read? Or parallel_interleave

Tags:

python

tensorflow

hdfs

tensorflow-datasets

tfrecord

Kevin Yen

1 Answers

kvish

Related questions

Recent Activity

Donate For Us