I've experienced this with custom made modules as well, but for this example I'm specifically using one of the official PyTorch examples and the MNIST dataset.
I've ported the exact architecture in Keras and TF2 with eager mode like so:
model = keras.models.Sequential([ keras.layers.Conv2D(32, (3, 3) , input_shape=(28,28,1), activation='relu'),
keras.layers.Conv2D(64, (3, 3)),
keras.layers.MaxPool2D((2, 2)),
keras.layers.Dropout(0.25),
keras.layers.Flatten(),
keras.layers.Dense(128, activation='relu'),
keras.layers.Dropout(0.5),
keras.layers.Dense(10, activation='softmax')]
)
model.summary()
model.compile(optimizer=keras.optimizers.Adadelta(), loss=keras.losses.sparse_categorical_crossentropy, metrics=['accuracy'])
model.fit(train_data,train_labels,batch_size=64,epochs=30,shuffle=True, max_queue_size=1)
The training loop in PyTorch is:
def train(args, model, device, train_loader, optimizer, epoch):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
With me timing every epoch like so:
for epoch in range(1, args.epochs + 1):
since = time.time()
train(args, model, device, train_loader, optimizer, epoch)
# test(args, model, device, test_loader)
# scheduler.step()
time_elapsed = time.time() - since
print('Training complete in {:.0f}m {:.0f}s'.format(
time_elapsed // 60, time_elapsed % 60))
I have verified that:
The Keras version runs at around 4-5 seconds per epoch while the PyTorch version runs at around 9-10 seconds per epoch.
Why is this and how can I improve this time?
I think there is a subtle difference that must be taken into consideration; my best bet/hunch is the following: it is not the processing time in itself per GPU, but the max_queue_size=10
parameter, 10 by default in Keras.
Since by default in the normal for-loop in PyTorch the data is not queued, the queue which Keras benefits from allows the transfer of data from CPU to GPU faster; in essence, there is much less time spent to feed the GPU, since it consumes faster from that internal queue/the overhead of transfering data from CPU to GPU is reduced.
Apart from my former observation, I cannot see any other visible difference, maybe other people can point out new findings.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With