Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Google colab not loading image files while using tensorflow 2.0 batched dataset

A little bit of background, I am loading about 60,000 images to colab to train a GAN. I have already uploaded them to Drive and the directory structure contains folders for different classes (about 7-8) inside root. I am loading them to colab as follows:

root = "drive/My Drive/data/images"
root = pathlib.Path(root)

list_ds = tf.data.Dataset.list_files(str(root/'*/*'))

for f in list_ds.take(3):
  print(f.numpy())

which gives the ouput:

b'drive/My Drive/data/images/folder_1/2994.jpg'
b'drive/My Drive/data/images/folder_1/6628.jpg'
b'drive/My Drive/data/images/folder_2/37872.jpg'

I am further processing them as follows:

def process_path(file_path):
  label = tf.strings.split(file_path, '/')[-2]
  image = tf.io.read_file(file_path)
  image = tf.image.decode_jpeg(image)
  image = tf.image.convert_image_dtype(image, tf.float32)
  return image#, label

ds = list_ds.map(process_path)

BUFFER_SIZE = 60000
BATCH_SIZE = 128

train_dataset = ds.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)

Each image is of size 128x128. Now coming to the problem when I try to view a batch in colab the execution goes on forever and never stops, for example, with this code:

for batch in train_dataset.take(4):
  print([arr.numpy() for arr in batch])

Earlier I thought that batch_size might be a issue so tried changing it but still same problem. Can it be a problem due to colab as I am loading a large number of files?

Or due to the size of images as it was working when using MNIST(28x28)? If so, what are the possible solutions?

Thanks in advance.

EDIT: After removing the shuffle statement, the last line gets executed within a few seconds. So I thought it could be a problem due to BUFFER_SIZE of shuffle, but even with a reduced BUFFER_SIZE, it is again taking a very long time to execute. Any workaround?

like image 291
bkshi Avatar asked Mar 27 '20 13:03

bkshi


1 Answers

Here is how I load a 1.12GB zipped FLICKR image dataset from my personal Google Drive. First, I unzip the dataset in the colab environment. Some features that can speed up the performance is prefetch and autotune. Additionally, I use the local colab cache to store the processed images. This takes ~20 seconds to execute the first time (assuming you have unzipped the dataset). The cache then allows subsequent calls to load very fast.

Assuming you have authorized the google drive API, I start with unzipping the folder(s)

!unzip /content/drive/My\ Drive/Flickr8k
!unzip Flickr8k_Dataset
!ls

I then used your code with the addition of prefetch(), autotune, and cache file.

import pathlib
import tensorflow as tf

def prepare_for_training(ds, cache, BUFFER_SIZE, BATCH_SIZE):
  if cache:
    if isinstance(cache, str):
      ds = ds.cache(cache)
    else:
      ds = ds.cache()
  ds = ds.shuffle(buffer_size=BUFFER_SIZE)
  ds = ds.batch(BATCH_SIZE)
  ds = ds.prefetch(buffer_size=AUTOTUNE)
  return ds

AUTOTUNE = tf.data.experimental.AUTOTUNE

root = "Flicker8k_Dataset"
root = pathlib.Path(root)

list_ds = tf.data.Dataset.list_files(str(root/'**'))

for f in list_ds.take(3):
  print(f.numpy())

def process_path(file_path):
  label = tf.strings.split(file_path, '/')[-2]
  img = tf.io.read_file(file_path)
  img = tf.image.decode_jpeg(img)
  img = tf.image.convert_image_dtype(img, tf.float32)
  # resize the image to the desired size.
  img =  tf.image.resize(img, [128, 128])
  return img#, label

ds = list_ds.map(process_path, num_parallel_calls=AUTOTUNE)
train_dataset = prepare_for_training(ds, cache="./custom_ds.tfcache", BUFFER_SIZE=600000, BATCH_SIZE=128)
for batch in train_dataset.take(4):
  print([arr.numpy() for arr in batch])

Here is a way to do it with keras flow_from_directory(). The benefit of this approach is that you avoid the tensorflow shuffle() which depending on the buffer size may require a processing of the whole dataset. Keras gives you an iterator which you can call to fetch the data batch and has the random shuffling built in.

import pathlib
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator

root = "Flicker8k_Dataset"
BATCH_SIZE=128

train_datagen = ImageDataGenerator(
    rescale=1./255 )

train_generator = train_datagen.flow_from_directory(
        directory = root,  # This is the source directory for training images
        target_size=(128, 128),  # All images will be resized
        batch_size=BATCH_SIZE,
        shuffle=True,
        seed=42, #for the shuffle
        classes=[''])

i = 4
for batch in range(i):
  [print(x[0]) for x in next(train_generator)]
like image 107
pastaleg Avatar answered Oct 24 '22 04:10

pastaleg