I have tokenized data in the form of a list of unequally shaped arrays:
array([array([1179, 6, 208, 2, 1625, 92, 9, 3870, 3, 2136, 435,
5, 2453, 2180, 44, 1, 226, 166, 3, 4409, 49, 6728,
...
10, 17, 1396, 106, 8002, 7968, 111, 33, 1130, 60, 181,
7988, 7974, 7970])], dtype=object)
With their respective targets:
Out[74]: array([0, 0, 0, ..., 0, 0, 1], dtype=object)
I'm trying to transform them into a padded tf.data.Dataset(), but it won't let me convert unequal shapes to a tensor. I will get this error:
ValueError: Can't convert non-rectangular Python sequence to Tensor.
The full code is here. Assume that my starting point is after y = ...:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np
(train_data, test_data) = tfds.load('imdb_reviews/subwords8k',
split=(tfds.Split.TRAIN, tfds.Split.TEST),
as_supervised=True)
x = np.array(list(train_data.as_numpy_iterator()))[:, 0]
y = np.array(list(train_data.as_numpy_iterator()))[:, 1]
train_tensor = tf.data.Dataset.from_tensor_slices((x.tolist(), y))\
.padded_batch(batch_size=8, padded_shapes=([None], ()))
What are my options to turn this into a padded batch tensor?
If your data is stored in Numpy arrays or Python lists, then you can use tf.data.Dataset.from_generator method to create the dataset and then pad the batches:
train_batches = tf.data.Dataset.from_generator(
lambda: iter(zip(x, y)),
output_types=(tf.int64, tf.int64)
).padded_batch(
batch_size=32,
padded_shapes=([None], ())
)
However, if you are using tensorflow_datasets.load function, then there is no need to use as_numpy_iterator to separate the data and the labels, and then put them back together in a dataset! That's redundant and inefficient. The objects returned by tensorflow_datasets.load are already an instance of tf.data.Dataset. So, you just need to use padded_batch on them:
train_batches = train_data.padded_batch(batch_size=32, padded_shapes=([None], []))
test_batches = test_data.padded_batch(batch_size=32, padded_shapes=([None], []))
Note that in TensorFlow 2.2 and above, you no longer need to provide the padded_shapes argument if you just want all the axes to be padded to the longest of the batch (i.e. default behavior).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With