For example, when pile up Dense layers
, conventionally, we always set the number of neurons as 256 neurons, 128 neurons, 64 neurons, ... and so on.
My question is:
What's the reason for conventionally use 2^n
neurons? Will this implementation makes the code runs faster? Saves memory? Or are there any other reasons?
It's historical. Early neural network implementations for GPU Computing (written in CUDA, OpenCL etc) had to concern themselves with efficient memory management to do data parallelism.
Generally speaking, you have to align N computations on physical processors. The number of physical processors is usually a power of 2. Therefore, if the number of computations is not a power of 2, the computations can't be mapped 1:1 and have to be moved around, requiring additional memory management (further reading here). This was only relevant for parallel batch processing, i.e. having the batch size as a power of 2 gave you slightly better performance. Interestingly, having other hyperparameters such as the number of hidden units as a power of 2 never had a measurable benefit - I assume as neural networks got more popular, people simply started adapting this practice without knowing why and spreading it to other hyperparameters.
Nowadays, some low-level implementations might still benefit from this convention but if you're using CUDA with Tensorflow or Pytorch in 2020 with a modern GPU architecture, you're very unlikely to encounter any difference between a batch size of 128 and 129 as these systems are highly optimized for very efficient data parallelism.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With