Neural networks for image recognition can be really big. There can be thousands of inputs/hidden neurons, millions of connections what can take up a lot of computer resources.
While float being commonly 32bit and double 64bit in c++, they don't have much performance difference in speed yet using floats can save up some memory.
Having a neural network what is using sigmoid as an activation function, if we could choose of which variables in neural network can be float or double which could be float to save up memory without making neural network unable to perform?
While inputs and outputs for training/test data can definitely be floats because they do not require double precision since colors in image can only be in range of 0-255 and when normalized 0.0-1.0 scale, unit value would be 1 / 255 = 0.0039~
1. what about hidden neurons output precision, would it be safe to make them float too?
hidden neuron's output gets it value from the sum of previous layer neuron's output * its connection weight to currently being calculating neuron and then sum being passed into activation function(currently sigmoid) to get the new output. Sum variable itself could be double since it could become a really large number when network is big.
2. what about connection weights, could they be floats?
while inputs and neuron's outputs are at the range of 0-1.0 because of sigmoid, weights are allowed to be bigger than that.
Stochastic gradient descent backpropagation suffers on vanishing gradient problem because of the activation function's derivative, I decided not to put this out as an a question of what precision should gradient variable be, feeling that float will simply not be precise enough, specially when network is deep.
1. Convolutional Neural Networks (CNNs) CNN's, also known as ConvNets, consist of multiple layers and are mainly used for image processing and object detection.
CNN is a powerful algorithm for image processing. These algorithms are currently the best algorithms we have for the automated processing of images. Many companies use these algorithms to do things like identifying the objects in an image. Images contain data of RGB combination.
In CNNs, the nodes in the hidden layers don't always share their output with every node in the next layer (known as convolutional layers). Deep learning allows machines to identify and extract features from images. This means they can learn the features to look for in images by analysing lots of pictures.
A Convolutional Neural Network is a special class of neural networks that are built with the ability to extract unique features from image data. For instance, they are used in face detection and recognition because they can identify complex features in image data.
- what about hidden neurons output precision, would it be safe to make them float too?
Using float32
everywhere is usually the safe first choice for most of the neural network applications. GPUs currently support only float32
, so many practitioners stick to float32
everywhere. For many applications, even 16-bit floating point values could be sufficient. Some extreme examples show that high accuracy networks can be trained with only as little as 2-bits per weight (https://arxiv.org/abs/1610.00324).
The complexity of the deep networks is usually limited not by the computation time, but by the amount of RAM on a single GPU and throughput of the memory bus. Even if you're working on CPU, using a smaller data type still helps to use the cache more efficiently. You're rarely limited by the machine datatype precision.
since colors in image can only be in range of 0-255,
You're doing it wrong. You force the network to learn the scale of your input data, when it is already known (unless you're using a custom weight initialization procedure). The better results are usually achieved when the input data is normalized to the range (-1, 1) or (0, 1) and the weights are initialized to have the average output of the layer at the same scale. This is a popular initialization technique: http://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization
If inputs are in the range [0, 255], then with an average input being ~ 100, and weights being ~ 1, the activation potential (the argument of the activation function) is going to be ~ 100×N, where N is the number of layer inputs, likely far away in the "flat" part of the sigmoid. So either you initialize your weights to be ~ 1/(100×N), or you scale your data and use any popular initialization method. Otherwise the network will have to spend a lot of training time just to bring the weights to this scale.
Stochastic gradient descent backpropagation suffers on vanishing gradient problem because of the activation function's derivative, I decided not to put this out as an a question of what precision should gradient variable be, feeling that float will simply not be precise enough, specially when network is deep.
It's much less a matter of machine arithmetic precision, but the scale of the outputs for each of the layers. In practice:
This video should be helpful to learn these concepts if you're not familiar with them.
From least amount of bits needed for single neuron:
The following papers have studied this question (descending chronological order):
Example from Deep Learning with Limited Numerical Precision:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With