I'm doing image classification with cudaconvnet with Daniel Nouri's noccn module, and want to implement data augmentation by taking lots of patches of an original image (and flipping it). When would it be best for this to take place?
I've identified 3 stages in the training process when it could:
a) when creating batches from the data
b) when getting the next batch to train
c) given a batch, when getting the next image to feed into the net
It seems to me advantage of a) is that I can scatter the augmented data across all batches. But it will take up 1000x more space on disk The original dataset is already 1TB, so completely infeasible.
b) and c) don't involve storing the new data on disk, but could I scatter the data across batches? If I don't, then supposing I have batch_size==128 and I can augment my data 1000x, then the next 8 batches will all contain images from the same class. Isn't that bad for training the net because each training sample won't be randomised at all?
Furthermore, if I pick b) or c) and create a new batch from k training examples, then data augmentation by n times will make the batchsize n*k instead of giving me n times more batches.
For example, in my case I have batchsize==128 and can expect 1000x data augmentation. So each batch will actually be of size 128*1000 and all I'll get is more accurate partial derivative estimates (and that to a useless extent because batchsize==128k is pointlessly high).
So what should I do?
Right, you'd want to have augmented samples as randomly interspersed throughout the rest of the data as possible. Otherwise, you'll definitely run into problems as you've mentioned because the batches won't be properly sampled and your gradient descent steps will be too biased. I am not too familiar with cudaconvnet, as I primarily work with Torch instead, but I do often run into the same situation as you with artificially augmented data.
Your best bet would be (c), kind of.
For me, the best place to augment the data is right when a sample gets loaded by your trainer's inner loop -- apply the random distortion, flip, crop (or however else you're augmenting your samples) right at that moment and to that single data sample. What this will accomplish is that every time the trainer tries to load a sample, it will actually receive a modified version which will probably be different from any other image it has seen at a previous iteration.
Then, of course, you will need to adjust something else to still get the 1000x data size factor in. Either:
This way, you'll always have your target classes as randomly distributed throughout your dataset as your original data was, without consuming any extra diskspace to cache your augmented samples. This is, of course, at the cost of additional computing power, since you'd be generating the samples on demand at every step along the way, but you already know that...
Additionally, and perhaps more importantly, your batches will stay at your original 128 size, so the mini-batching process will remain untouched and your learned parameter updates will continue to drop in at the same frequency you'd expect otherwise. This same process would work great also for SGD training (batch size = 1), since the trainer will never see the "same" image twice.
Hope that helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With