I have a tf.data.Dataset
instance which holds 3 different features
label
which is a scalarsequence_feature
which is a sequence of scalarsseq_of_seqs_feature
which is a sequence of sequences featureI am trying to use tf.data.Dataset.padded_batch()
to genereate padded data as input to my model - and I want to pad every feature differently.
Example batch:
[{'label': 24,
'sequence_feature': [1, 2],
'seq_of_seqs_feature': [[11.1, 22.2],
[33.3, 44.4]]},
{'label': 32,
'sequence_feature': [3, 4, 5],
'seq_of_seqs_feature': [[55.55, 66.66]]}]
Expected output:
[{'label': 24,
'sequence_feature': [1, 2, 0],
'seq_of_seqs_feature': [[11.1, 22.2],
[33.3, 44.4]]},
{'label': 32,
'sequence_feature': [3, 4, 5],
'seq_of_seqs_feature': [[55.55, 66.66],
0.0, 0.0 ]}]
As you can see the label
feature should not be padded, and the sequence_feature
and seq_of_seqs_feature
should be padded by the corresponding longest entry in the given batch.
To iterate over the dataset several times, use . repeat() . We can enumerate each batch by using either Python's enumerator or a build-in method. The former produces a tensor, which is recommended.
With the help of tf. data. Dataset. from_tensor_slices() method, we can get the slices of an array in the form of objects by using tf.
Introduction. Data shuffling is a common task usually performed prior to model training in order to create more representative training and testing sets. For instance, consider that your original dataset is sorted based on a specific column.
The tf. data API enables you to build complex input pipelines from simple, reusable pieces. For example, the pipeline for an image model might aggregate data from files in a distributed file system, apply random perturbations to each image, and merge randomly selected images into a batch for training.
The tf.data.Dataset.padded_batch()
method allows you to specify padded_shapes
for each component (feature) of the resulting batch. For example, if your input dataset is called ds
:
padded_ds = ds.padded_batch(
BATCH_SIZE,
padded_shapes={
'label': [], # Scalar elements, no padding.
'sequence_feature': [None], # Vector elements, padded to longest.
'seq_of_seqs_feature': [None, None], # Matrix elements, padded to longest
}) # in each dimension.
Notice that the padded_shapes
argument has the same structure as your input dataset's elements, so in this case it takes a dictionary with keys that match your feature names.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With