this question regards the common problem of training on multiple large files in Keras which are jointly too large to fit on GPU memory. I am using Keras 1.0.5 and I would like a solution that does not require 1.0.6. One way to do this was described by fchollet here and here:
# Create generator that yields (current features X, current labels y)
def BatchGenerator(files):
for file in files:
current_data = pickle.load(open("file", "rb"))
X_train = current_data[:,:-1]
y_train = current_data[:,-1]
yield (X_train, y_train)
# train model on each dataset
for epoch in range(n_epochs):
for (X_train, y_train) in BatchGenerator(files):
model.fit(X_train, y_train, batch_size = 32, nb_epoch = 1)
However I fear that the state of the model is not saved, rather that the model is reinitialized not only between epochs but also between datasets. Each "Epoch 1/1" represents training on a different dataset below:
~~~~~ Epoch 0 ~~~~~~
Epoch 1/1
295806/295806 [==============================] - 13s - loss: 15.7517
Epoch 1/1
407890/407890 [==============================] - 19s - loss: 15.8036
Epoch 1/1
383188/383188 [==============================] - 19s - loss: 15.8130
~~~~~ Epoch 1 ~~~~~~
Epoch 1/1
295806/295806 [==============================] - 14s - loss: 15.7517
Epoch 1/1
407890/407890 [==============================] - 20s - loss: 15.8036
Epoch 1/1
383188/383188 [==============================] - 15s - loss: 15.8130
I am aware that one can use model.fit_generator but as the method above was repeatedly suggested as a way of batch training I would like to know what I am doing wrong.
Thanks for your help,
Max
Note: As our dataset is too large to fit in memory, we have to load the dataset from the hard disk in batches to our memory. To do so, we are going to create a custom generator. Our Custom Generator is going to load the dataset from the hard disk in batches to memory.
Online learning algorithms can also be used to train systems on huge datasets that cannot fit in one machine's main memory (this is called out-of-core learning).
It has been a while since I faced that problem but I remember that I used
Kera's functionality to provide data through Python generators, i.e. model = Sequential(); model.fit_generator(...)
.
An exemplary code snippet (should be self-explanatory)
def generate_batches(files, batch_size):
counter = 0
while True:
fname = files[counter]
print(fname)
counter = (counter + 1) % len(files)
data_bundle = pickle.load(open(fname, "rb"))
X_train = data_bundle[0].astype(np.float32)
y_train = data_bundle[1].astype(np.float32)
y_train = y_train.flatten()
for cbatch in range(0, X_train.shape[0], batch_size):
yield (X_train[cbatch:(cbatch + batch_size),:,:], y_train[cbatch:(cbatch + batch_size)])
model = Sequential()
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
train_files = [train_bundle_loc + "bundle_" + cb.__str__() for cb in range(nb_train_bundles)]
gen = generate_batches(files=train_files, batch_size=batch_size)
history = model.fit_generator(gen, samples_per_epoch=samples_per_epoch, nb_epoch=num_epoch,verbose=1, class_weight=class_weights)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With