I have a huge dataset that I need to provide to Keras in the form of a generator because it does not fit into memory. However, using fit_generator
, I cannot replicate the results I get during usual training with model.fit
. Also each epoch lasts considerably longer.
I implemented a minimal example. Maybe someone can show me where the problem is.
import random
import numpy
from keras.layers import Dense
from keras.models import Sequential
random.seed(23465298)
numpy.random.seed(23465298)
no_features = 5
no_examples = 1000
def get_model():
network = Sequential()
network.add(Dense(8, input_dim=no_features, activation='relu'))
network.add(Dense(1, activation='sigmoid'))
network.compile(loss='binary_crossentropy', optimizer='adam')
return network
def get_data():
example_input = [[float(f_i == e_i % no_features) for f_i in range(no_features)] for e_i in range(no_examples)]
example_target = [[float(t_i % 2)] for t_i in range(no_examples)]
return example_input, example_target
def data_gen(all_inputs, all_targets, batch_size=10):
input_batch = numpy.zeros((batch_size, no_features))
target_batch = numpy.zeros((batch_size, 1))
while True:
for example_index, each_example in enumerate(zip(all_inputs, all_targets)):
each_input, each_target = each_example
wrapped = example_index % batch_size
input_batch[wrapped] = each_input
target_batch[wrapped] = each_target
if wrapped == batch_size - 1:
yield input_batch, target_batch
if __name__ == "__main__":
input_data, target_data = get_data()
g = data_gen(input_data, target_data, batch_size=10)
model = get_model()
model.fit(input_data, target_data, epochs=15, batch_size=10) # 15 * (1000 / 10) * 10
# model.fit_generator(g, no_examples // 10, epochs=15) # 15 * (1000 / 10) * 10
On my computer, model.fit
always finishes the 10th epoch with a loss of 0.6939
and after ca. 2-3 seconds.
The method model.fit_generator
, however, runs considerably longer and finishes the last epoch with a different loss (0.6931
).
I don't understand in general why the results in both approaches differ. This might not appear like much of a difference but I need to be sure that the same data with the same net produce the same result, independent from conventional training or using the generator.
Update: @Alex R. provided an answer for part of the original problem (some of the performance issue as well as changing results with each run). As the core problem remains, however, I merely adjusted the question and title accordingly.
You pass your whole dataset at once in fit method. Also, use it if you can load whole data into your memory (small dataset). In fit_generator() , you don't pass the x and y directly, instead they come from a generator.
fit is used when the entire training dataset can fit into the memory and no data augmentation is applied. . fit_generator is used when either we have a huge dataset to fit into our memory or when data augmentation needs to be applied.
After you have created and configured your ImageDataGenerator, you must fit it on your data. This will calculate any statistics required to actually perform the transforms to your image data. You can do this by calling the fit() function on the data generator and passing it to your training dataset.
Batch sizes
fit
, you're using the standard batch size = 32. fit_generator
, you're using a batch size = 10. Keras probably runs the weight updates after each batch, so, if you're using batches of different size, there is a chance of getting different gradients between the two methods. And once there a different weight update, both models will never meet again.
Try to use fit with batch_size=10
, or use a generator with batch_size=32
.
Seed problem?
Are you creating a new model with get_model()
for each case?
If so, the weights in both models are different, and naturally you will have different results for both models. (Ok, you've set a seed, but if you're using tensorflow, maybe you're facing this issue)
On the long run they will sort of converge, though. The difference between both doesn't seem that much.
Checking data
If you are not sure that your generator yields the same data as you expect, do a simple loop on it and print/compare/check the data it yields:
for i in range(numberOfBatches):
x,y = g.next() #or next(g)
#print or compare x,y here.
Make sure to shuffle your batches within your generator.
This discussion suggests you turn on shuffle in your iterator: https://github.com/keras-team/keras/issues/2389. I had the same problem and this resolved it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With