there are plenty of examples how to create and use TensorFlow datasets, e.g.
dataset = tf.data.Dataset.from_tensor_slices((images, labels))
My question is how to get back the data/labels from the TF dataset in numpy form? In other words want would be reverse operation of the line above, i.e. I have a TF dataset and want to get back images and labels from it.
In case your tf.data.Dataset
is batched, the following code will retrieve all the y labels:
y = np.concatenate([y for x, y in ds], axis=0)
Quick explanation: [y for x, y in ds] is known as “list comprehension” in python. If dataset is batched, this expression will loop thru each batch and put each batch y (a TF 1D tensor) in the list, and return it. Then, np.concatenate will take this list of 1-D tensor (implicitly casting to numpy) and stack it in the 0-axis to produce a single long vector. In summary, it is just converting a bunch of 1-d little vector into one long vector. Note: if your y is more complex, this answer will need some minor modification.
Supposing our tf.data.Dataset is called train_dataset
, with eager_execution
on (default in TF 2.x), you can retrieve images and labels like this:
for images, labels in train_dataset.take(1): # only take first element of dataset
numpy_images = images.numpy()
numpy_labels = labels.numpy()
.numpy()
converts tf.Tensors in numpy arrays-1
If you are OK with keeping the images and labels as tf.Tensor
s, you can do
images, labels = tuple(zip(*dataset))
Think of the effect of the dataset as zip(images, labels)
. When we want to get images and labels back, we can simply unzip it.
If you need the numpy array version, convert them using np.array()
:
images = np.array(images)
labels = np.array(labels)
I think we get a good example here:
https://colab.research.google.com/github/tensorflow/datasets/blob/master/docs/overview.ipynb#scrollTo=BC4pEXtkp4K-
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
# where mnsit train is a tf dataset
mnist_train = tfds.load(name="mnist", split=tfds.Split.TRAIN)
assert isinstance(mnist_train, tf.data.Dataset)
mnist_example, = mnist_train.take(1)
image, label = mnist_example["image"], mnist_example["label"]
plt.imshow(image.numpy()[:, :, 0].astype(np.float32), cmap=plt.get_cmap("gray"))
print("Label: %d" % label.numpy())
So each individual component of the dataset can be accessed sort of like a dictionary. Presumably different datasets have different field names (Boston housing won't have image, and value, but might have 'features' and 'target' or 'price':
cnn = tfds.load(name="cnn_dailymail", split=tfds.Split.TRAIN)
assert isinstance(cnn, tf.data.Dataset)
cnn_ex, = cnn.take(1)
print(cnn_ex)
returns a dict() with keys ['article', 'highlight'] with numpy strings inside.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With