How to store and load training data comprised 50 millions 25x25 numpy arrays while training a multi-class CNN model?

Question

I have an image processing problem where there are five classes, each class has approximately 10 millions examples as training data where an image is a z-scored 25x25 numpy array.

Obviously, I can’t load all the training data into memory, so I have to use fit_generator.

I also the one who generates and augments these training data matrices, but I can’t do it in real time within fit_generator because it will be too slow to train the model.

First, how to store 50 millions 25x25 .npy arrays on disk? What would be the best practice?

Second, Should I use a database to store these matrices and to query from it during training? I don’t think SQLite supports multi threads, and SQL datasets support is still experimental in tensorflow.

I would love to know if there is a neat way to store these 50 million matrices, so the retrieval during training will be optimal.

Third, what about using HDF5 format? Should I switch to pytorch instead?

Victor Deleau · Accepted Answer

How to store np.arrays() on disk ?

Storing them in a HDF5 file is a good idea. The basic HDF5 type is a Datasets, which contain multidimensional arrays of an homogeneous type. HDF5 Datasets files can be assembled together into HDF5 Groups files, which can also contain other groups, to create more complex structures. Another way is to pickle your numpy arrays or more abstract dataset objects directly from disk, but then your file would be readable by Python only. It is also discouraged for security reasons. Finally if you want to optimize your data format for TensorFlow read/write operations you can use the TFRecord file format. Saving your numpy array in a TFRecord format can be tricky but thanksfully someone created a script to do that.

Should I use a database to store these matrices and to query from them during training?

You could, but then you would reinvent the wheel. What you need is one or more separate processes in parralel of your training process, reading the next batch of training observations (prefetch), and applying some transformations to them while the training process is working on the previous batch. This way you avoid any IO and preprocessing delay, and can get some significant performance gains. AI frameworks developed their own tools for this problem. In Pytorch, there is the class torch.utils.data.DataLoader. Here is a tutorial that shows how to efficiently load HDF5 files using a Dataloader. In TensorFlow, you can create an input pipeline using the class tf.data.Dataset. A basic approach is to first open a file(s) (1), read the data from the file(s) into the memory (2), then train your model using what's in memory (3). Let's mock a TF Dataset and training loop:

import tf, time

class MyDataset(tf.data.Dataset):
    def __new__(self, filename="image_dataset.proto"):
        time.sleep(0.01) # mock step (1) delay
        return tf.data.TFRecordDataset([filename])

def train(dataset, nb_epoch=10):
    start_time = time.perf_counter()
    for epoch_num in range(nb_epoch):
        for sample in dataset: # where step (2) delay takes place
            time.sleep(0.01) # mock step (3) delay
        tf.print("Execution time:", time.perf_counter() - start_time)

You can just apply steps (1, 2, 3) sequentially:

train(MyDataset())

A better way is to read the next batch of data while the training process is still training on the previous batch, such that steps (2, 3) can happen in parralel. Apply transformations to the next batch while still training on the previous batch is also possible. To prefetch:

train(MyDataset().prefetch(tf.data.experimental.AUTOTUNE))

Additionally you can have multiple processes to read your file(s) and have a sequence of steps (1, 2) running in parralel:

train( tf.data.Dataset.range(2).interleave(\
    MyDataset().prefetch(tf.data.experimental.AUTOTUNE),\
    num_parallel_calls=tf.data.experimental.AUTOTUNE))

Learn more in the documentation.

Should I switch to Pytorch instead ?

Almost everything that Pytorch can do, TensorFlow can do too. TensorFlow has been the most production ready AI framework for a while, used by Google for their TPUs. Pytorch is catching up though. I would say that Pytorch is more research/development oriented, while TensorFlow is more production oriented. Another difference is how you design your neural networks: Pytorch works by adding layers on top of each others, while in TensorFlow you first design a computational graph that you run on some input data at some point. People often develop their models in Pytorch, and then export them in a TensorFlow format to use in production.

How to store and load training data comprised 50 millions 25x25 numpy arrays while training a multi-class CNN model?

Tags:

python

tensorflow

deep-learning

keras

pytorch

0x90

1 Answers

Victor Deleau

Recent Activity

Donate For Us

How to store and load training data comprised 50 millions 25x25 numpy arrays while training a multi-class CNN model?

Tags:

python

tensorflow

deep-learning

keras

pytorch

0x90

1 Answers

Victor Deleau

Related questions

Recent Activity

Donate For Us