Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

tf.data.Dataset.padded_batch pad differently each feature

I have a tf.data.Dataset instance which holds 3 different features

  • label which is a scalar
  • sequence_feature which is a sequence of scalars
  • seq_of_seqs_feature which is a sequence of sequences feature

I am trying to use tf.data.Dataset.padded_batch() to genereate padded data as input to my model - and I want to pad every feature differently.

Example batch:

[{'label': 24,
  'sequence_feature': [1, 2],
  'seq_of_seqs_feature': [[11.1, 22.2],
                          [33.3, 44.4]]},
 {'label': 32,
  'sequence_feature': [3, 4, 5],
  'seq_of_seqs_feature': [[55.55, 66.66]]}]

Expected output:

[{'label': 24,
  'sequence_feature': [1, 2, 0],
  'seq_of_seqs_feature': [[11.1, 22.2],
                          [33.3, 44.4]]},
 {'label': 32,
  'sequence_feature': [3, 4, 5],
  'seq_of_seqs_feature': [[55.55, 66.66],
                           0.0, 0.0    ]}]

As you can see the label feature should not be padded, and the sequence_feature and seq_of_seqs_feature should be padded by the corresponding longest entry in the given batch.

like image 524
bluesummers Avatar asked Apr 15 '18 08:04

bluesummers


People also ask

How do I iterate over a TensorFlow dataset?

To iterate over the dataset several times, use . repeat() . We can enumerate each batch by using either Python's enumerator or a build-in method. The former produces a tensor, which is recommended.

What TF data dataset From_tensor_slices do?

With the help of tf. data. Dataset. from_tensor_slices() method, we can get the slices of an array in the form of objects by using tf.

What is dataset shuffle?

Introduction. Data shuffling is a common task usually performed prior to model training in order to create more representative training and testing sets. For instance, consider that your original dataset is sorted based on a specific column.

What does TF data dataset do?

The tf. data API enables you to build complex input pipelines from simple, reusable pieces. For example, the pipeline for an image model might aggregate data from files in a distributed file system, apply random perturbations to each image, and merge randomly selected images into a batch for training.


1 Answers

The tf.data.Dataset.padded_batch() method allows you to specify padded_shapes for each component (feature) of the resulting batch. For example, if your input dataset is called ds:

padded_ds = ds.padded_batch(
    BATCH_SIZE,
    padded_shapes={
        'label': [],                          # Scalar elements, no padding.
        'sequence_feature': [None],           # Vector elements, padded to longest.
        'seq_of_seqs_feature': [None, None],  # Matrix elements, padded to longest
    })                                        # in each dimension.

Notice that the padded_shapes argument has the same structure as your input dataset's elements, so in this case it takes a dictionary with keys that match your feature names.

like image 185
mrry Avatar answered Sep 22 '22 15:09

mrry