I have an image processing problem where there are five classes, each class has approximately 10 millions examples as training data where an image is a z-scored 25x25 numpy array.
Obviously, I can’t load all the training data into memory, so I have to use fit_generator
.
I also the one who generates and augments these training data matrices, but I can’t do it in real time within fit_generator
because it will be too slow to train the model.
First, how to store 50 millions 25x25 .npy arrays on disk? What would be the best practice?
Second, Should I use a database to store these matrices and to query from it during training? I don’t think SQLite supports multi threads, and SQL datasets support is still experimental in tensorflow.
I would love to know if there is a neat way to store these 50 million matrices, so the retrieval during training will be optimal.
Third, what about using HDF5 format? Should I switch to pytorch instead?
How to store np.arrays() on disk ?
Storing them in a HDF5 file is a good idea. The basic HDF5 type is a Datasets
, which contain multidimensional arrays of an homogeneous type. HDF5 Datasets
files can be assembled together into HDF5 Groups
files, which can also contain other groups, to create more complex structures. Another way is to pickle your numpy arrays or more abstract dataset objects directly from disk, but then your file would be readable by Python only. It is also discouraged for security reasons. Finally if you want to optimize your data format for TensorFlow read/write operations you can use the TFRecord
file format. Saving your numpy array in a TFRecord format can be tricky but thanksfully someone created a script to do that.
Should I use a database to store these matrices and to query from them during training?
You could, but then you would reinvent the wheel. What you need is one or more separate processes in parralel of your training process, reading the next batch of training observations (prefetch), and applying some transformations to them while the training process is working on the previous batch. This way you avoid any IO and preprocessing delay, and can get some significant performance gains. AI frameworks developed their own tools for this problem. In Pytorch, there is the class torch.utils.data.DataLoader
. Here is a tutorial that shows how to efficiently load HDF5 files using a Dataloader. In TensorFlow, you can create an input pipeline using the class tf.data.Dataset
. A basic approach is to first open a file(s) (1), read the data from the file(s) into the memory (2), then train your model using what's in memory (3). Let's mock a TF Dataset and training loop:
import tf, time
class MyDataset(tf.data.Dataset):
def __new__(self, filename="image_dataset.proto"):
time.sleep(0.01) # mock step (1) delay
return tf.data.TFRecordDataset([filename])
def train(dataset, nb_epoch=10):
start_time = time.perf_counter()
for epoch_num in range(nb_epoch):
for sample in dataset: # where step (2) delay takes place
time.sleep(0.01) # mock step (3) delay
tf.print("Execution time:", time.perf_counter() - start_time)
You can just apply steps (1, 2, 3) sequentially:
train(MyDataset())
A better way is to read the next batch of data while the training process is still training on the previous batch, such that steps (2, 3) can happen in parralel. Apply transformations to the next batch while still training on the previous batch is also possible. To prefetch:
train(MyDataset().prefetch(tf.data.experimental.AUTOTUNE))
Additionally you can have multiple processes to read your file(s) and have a sequence of steps (1, 2) running in parralel:
train( tf.data.Dataset.range(2).interleave(\
MyDataset().prefetch(tf.data.experimental.AUTOTUNE),\
num_parallel_calls=tf.data.experimental.AUTOTUNE))
Learn more in the documentation.
Should I switch to Pytorch instead ?
Almost everything that Pytorch can do, TensorFlow can do too. TensorFlow has been the most production ready AI framework for a while, used by Google for their TPUs. Pytorch is catching up though. I would say that Pytorch is more research/development oriented, while TensorFlow is more production oriented. Another difference is how you design your neural networks: Pytorch works by adding layers on top of each others, while in TensorFlow you first design a computational graph that you run on some input data at some point. People often develop their models in Pytorch, and then export them in a TensorFlow format to use in production.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With