Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between tf.data.Dataset.map() and tf.data.Dataset.apply()

With the recent upgrade to version 1.4, Tensorflow included tf.data in the library core. One "major new feature" described in the version 1.4 release notes is tf.data.Dataset.apply(), which is a "method for applying custom transformation functions". How is this different from the already existing tf.data.Dataset.map()?

like image 378
GPhilo Avatar asked Nov 03 '17 08:11

GPhilo


People also ask

What is TensorFlow PrefetchDataset?

public final class PrefetchDataset. Creates a dataset that asynchronously prefetches elements from `input_dataset`.

What does TF data dataset do?

An overview of tf. data. The Dataset API allows you to build an asynchronous, highly optimized data pipeline to prevent your GPU from data starvation. It loads data from the disk (images or text), applies optimized transformations, creates batches and sends it to the GPU.

What does TF data dataset From_tensor_slices do?

from_tensor_slices(tensor) creates a Dataset whose elements are slices of the given tensors.


2 Answers

The difference is that map will execute one function on every element of the Dataset separately, whereas apply will execute one function on the whole Dataset at once (such as group_by_window given as example in the documentation).

The argument of apply is a function that takes a Dataset and returns a Dataset when the argument of map is a function that takes one element and returns one transformed element.

like image 125
Sunreef Avatar answered Sep 22 '22 07:09

Sunreef


Sunreef's answer is absolutely correct. You might still be wondering why we introduced Dataset.apply(), and I thought I'd offer some background.

The tf.data API has a set of core transformations—like Dataset.map() and Dataset.filter()—that are generally useful across a wide range of datasets, unlikely to change, and implemented as methods on the tf.data.Dataset object. In particular, they are subject to the same backwards compatibility guarantees as other core APIs in TensorFlow.

However, the core approach is a bit restrictive. We also want the freedom to experiment with new transformations before adding them to the core, and to allow other library developers to create their own reusable transformations. Therefore, in TensorFlow 1.4 we split out a set of custom transformations that live in tf.contrib.data. The custom transformations include some that have very specific functionality (like tf.contrib.data.sloppy_interleave()), and some where the API is still in flux (like tf.contrib.data.group_by_window()). Originally we implemented these custom transformations as functions from Dataset to Dataset, which had an unfortunate effect on the syntactic flow of a pipeline. For example:

dataset = tf.data.TFRecordDataset(...).map(...)  # Method chaining breaks when we apply a custom transformation. dataset = custom_transformation(dataset, x, y, z)  dataset = dataset.shuffle(...).repeat(...).batch(...) 

Since this seemed to be a common pattern, we added Dataset.apply() as a way to chain core and custom transformations in a single pipeline:

dataset = (tf.data.TFRecordDataset(...)            .map(...)            .apply(custom_transformation(x, y, z))            .shuffle(...)            .repeat(...)            .batch(...)) 

It's a minor feature in the grand scheme of things, but hopefully it helps to make tf.data programs easier to read, and the library easier to extend.

like image 20
mrry Avatar answered Sep 24 '22 07:09

mrry