A little bit of background, I am loading about 60,000 images to colab to train a GAN. I have already uploaded them to Drive and the directory structure contains folders for different classes (about 7-8) inside root
. I am loading them to colab as follows:
root = "drive/My Drive/data/images"
root = pathlib.Path(root)
list_ds = tf.data.Dataset.list_files(str(root/'*/*'))
for f in list_ds.take(3):
print(f.numpy())
which gives the ouput:
b'drive/My Drive/data/images/folder_1/2994.jpg'
b'drive/My Drive/data/images/folder_1/6628.jpg'
b'drive/My Drive/data/images/folder_2/37872.jpg'
I am further processing them as follows:
def process_path(file_path):
label = tf.strings.split(file_path, '/')[-2]
image = tf.io.read_file(file_path)
image = tf.image.decode_jpeg(image)
image = tf.image.convert_image_dtype(image, tf.float32)
return image#, label
ds = list_ds.map(process_path)
BUFFER_SIZE = 60000
BATCH_SIZE = 128
train_dataset = ds.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
Each image is of size 128x128
. Now coming to the problem when I try to view a batch in colab the execution goes on forever and never stops, for example, with this code:
for batch in train_dataset.take(4):
print([arr.numpy() for arr in batch])
Earlier I thought that batch_size might be a issue so tried changing it but still same problem. Can it be a problem due to colab as I am loading a large number of files?
Or due to the size of images as it was working when using MNIST(28x28)? If so, what are the possible solutions?
Thanks in advance.
EDIT: After removing the shuffle statement, the last line gets executed within a few seconds. So I thought it could be a problem due to BUFFER_SIZE of shuffle, but even with a reduced BUFFER_SIZE, it is again taking a very long time to execute. Any workaround?
Here is how I load a 1.12GB zipped FLICKR image dataset from my personal Google Drive. First, I unzip the dataset in the colab environment. Some features that can speed up the performance is prefetch
and autotune
. Additionally, I use the local colab cache to store the processed images. This takes ~20 seconds to execute the first time (assuming you have unzipped the dataset). The cache then allows subsequent calls to load very fast.
Assuming you have authorized the google drive API, I start with unzipping the folder(s)
!unzip /content/drive/My\ Drive/Flickr8k
!unzip Flickr8k_Dataset
!ls
I then used your code with the addition of prefetch()
, autotune
, and cache file
.
import pathlib
import tensorflow as tf
def prepare_for_training(ds, cache, BUFFER_SIZE, BATCH_SIZE):
if cache:
if isinstance(cache, str):
ds = ds.cache(cache)
else:
ds = ds.cache()
ds = ds.shuffle(buffer_size=BUFFER_SIZE)
ds = ds.batch(BATCH_SIZE)
ds = ds.prefetch(buffer_size=AUTOTUNE)
return ds
AUTOTUNE = tf.data.experimental.AUTOTUNE
root = "Flicker8k_Dataset"
root = pathlib.Path(root)
list_ds = tf.data.Dataset.list_files(str(root/'**'))
for f in list_ds.take(3):
print(f.numpy())
def process_path(file_path):
label = tf.strings.split(file_path, '/')[-2]
img = tf.io.read_file(file_path)
img = tf.image.decode_jpeg(img)
img = tf.image.convert_image_dtype(img, tf.float32)
# resize the image to the desired size.
img = tf.image.resize(img, [128, 128])
return img#, label
ds = list_ds.map(process_path, num_parallel_calls=AUTOTUNE)
train_dataset = prepare_for_training(ds, cache="./custom_ds.tfcache", BUFFER_SIZE=600000, BATCH_SIZE=128)
for batch in train_dataset.take(4):
print([arr.numpy() for arr in batch])
Here is a way to do it with keras flow_from_directory()
. The benefit of this approach is that you avoid the tensorflow shuffle()
which depending on the buffer size may require a processing of the whole dataset. Keras gives you an iterator which you can call to fetch the data batch and has the random shuffling built in.
import pathlib
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
root = "Flicker8k_Dataset"
BATCH_SIZE=128
train_datagen = ImageDataGenerator(
rescale=1./255 )
train_generator = train_datagen.flow_from_directory(
directory = root, # This is the source directory for training images
target_size=(128, 128), # All images will be resized
batch_size=BATCH_SIZE,
shuffle=True,
seed=42, #for the shuffle
classes=[''])
i = 4
for batch in range(i):
[print(x[0]) for x in next(train_generator)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With