Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does bias in the convolutional layer really make a difference to the test accuracy?

I understand that bias are required in small networks, to shift the activation function. But in the case of Deep network that has multiple layers of CNN, pooling, dropout and other non -linear activations, is Bias really making a difference? The convolutional filter is learning local features and for a given conv output channel same bias is used.

This is not a dupe of this link. The above link only explains role of bias in small neural network and does not attempt to explain role of bias in deep-networks containing multiple CNN layers, drop-outs, pooling and non-linear activation functions.

I ran a simple experiment and the results indicated that removing bias from conv layer made no difference in final test accuracy. There are two models trained and the test-accuracy is almost same (slightly better in one without bias.)

  • model_with_bias,
  • model_without_bias( bias not added in conv layer)

Are they being used only for historical reasons?

If using bias provides no gain in accuracy, shouldn't we omit them? Less parameters to learn.

I would be thankful if someone who have deeper knowledge than me, could explain the significance(if- any) of these bias in deep networks.

Here is the complete code and the experiment result bias-VS-no_bias experiment

batch_size = 16
patch_size = 5
depth = 16
num_hidden = 64

graph = tf.Graph()

with graph.as_default():

  # Input data.
  tf_train_dataset = tf.placeholder(
    tf.float32, shape=(batch_size, image_size, image_size, num_channels))
  tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset = tf.constant(test_dataset)

  # Variables.
  layer1_weights = tf.Variable(tf.truncated_normal(
      [patch_size, patch_size, num_channels, depth], stddev=0.1))
  layer1_biases = tf.Variable(tf.zeros([depth]))
  layer2_weights = tf.Variable(tf.truncated_normal(
      [patch_size, patch_size, depth, depth], stddev=0.1))
  layer2_biases = tf.Variable(tf.constant(1.0, shape=[depth]))
  layer3_weights = tf.Variable(tf.truncated_normal(
      [image_size // 4 * image_size // 4 * depth, num_hidden], stddev=0.1))
  layer3_biases = tf.Variable(tf.constant(1.0, shape=[num_hidden]))
  layer4_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_labels], stddev=0.1))
  layer4_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))

  # define a Model with bias .
  def model_with_bias(data):
    conv = tf.nn.conv2d(data, layer1_weights, [1, 2, 2, 1], padding='SAME')
    hidden = tf.nn.relu(conv + layer1_biases)
    conv = tf.nn.conv2d(hidden, layer2_weights, [1, 2, 2, 1], padding='SAME')
    hidden = tf.nn.relu(conv + layer2_biases)
    shape = hidden.get_shape().as_list()
    reshape = tf.reshape(hidden, [shape[0], shape[1] * shape[2] * shape[3]])
    hidden = tf.nn.relu(tf.matmul(reshape, layer3_weights) + layer3_biases)
    return tf.matmul(hidden, layer4_weights) + layer4_biases

  # define a Model without bias added in the convolutional layer.
  def model_without_bias(data):
    conv = tf.nn.conv2d(data, layer1_weights, [1, 2, 2, 1], padding='SAME')
    hidden = tf.nn.relu(conv ) # layer1_ bias is not added 
    conv = tf.nn.conv2d(hidden, layer2_weights, [1, 2, 2, 1], padding='SAME')
    hidden = tf.nn.relu(conv) # + layer2_biases)
    shape = hidden.get_shape().as_list()
    reshape = tf.reshape(hidden, [shape[0], shape[1] * shape[2] * shape[3]])
    # bias are added only in Fully connected layer(layer 3 and layer 4)
    hidden = tf.nn.relu(tf.matmul(reshape, layer3_weights) + layer3_biases)
    return tf.matmul(hidden, layer4_weights) + layer4_biases

  # Training computation.
  logits_with_bias = model_with_bias(tf_train_dataset)
  loss_with_bias = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits_with_bias))

  logits_without_bias = model_without_bias(tf_train_dataset)
  loss_without_bias = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits_without_bias))

  # Optimizer.
  optimizer_with_bias = tf.train.GradientDescentOptimizer(0.05).minimize(loss_with_bias)
  optimizer_without_bias = tf.train.GradientDescentOptimizer(0.05).minimize(loss_without_bias)

  # Predictions for the training, validation, and test data.
  train_prediction_with_bias = tf.nn.softmax(logits_with_bias)
  valid_prediction_with_bias = tf.nn.softmax(model_with_bias(tf_valid_dataset))
  test_prediction_with_bias = tf.nn.softmax(model_with_bias(tf_test_dataset))

  # Predictions for without
  train_prediction_without_bias = tf.nn.softmax(logits_without_bias)
  valid_prediction_without_bias = tf.nn.softmax(model_without_bias(tf_valid_dataset))
  test_prediction_without_bias = tf.nn.softmax(model_without_bias(tf_test_dataset))

num_steps = 1001

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print('Initialized')
  for step in range(num_steps):
    offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
    batch_data = train_dataset[offset:(offset + batch_size), :, :, :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
    session.run(optimizer_with_bias, feed_dict=feed_dict)
    session.run(optimizer_without_bias, feed_dict = feed_dict)
  print('Test accuracy(with bias): %.1f%%' % accuracy(test_prediction_with_bias.eval(), test_labels))
  print('Test accuracy(without bias): %.1f%%' % accuracy(test_prediction_without_bias.eval(), test_labels))

Output:

Initialized

Test accuracy(with bias): 90.5%

Test accuracy(without bias): 90.6%

like image 950
Aparajuli Avatar asked Aug 22 '18 03:08

Aparajuli


People also ask

Does convolution layer have bias?

Parameters of a Convolutional LayerThere is one bias for each output channel. Each bias is added to every element in that output channel. Note that the bias computation was not shown in the above figures, and are often omitted in other texts describing convolutional arithmetics. Nevertheless, the biases are there.

What is the use of bias in CNN?

Bias allows you to shift the activation function by adding a constant (i.e. the given bias) to the input. Bias in Neural Networks can be thought of as analogous to the role of a constant in a linear function, whereby the line is effectively transposed by the constant value.

How is bias added in CNN?

Since in CNN, we are taking one filter to indicate one feature. We introduce a variable(b) to incorporate the bias from that particular filter. Hence, each filter takes into account the bias that it can cause. The bias is not due to individual filter weights but the whole filter itself.

Do deeper convolutional networks perform better?

Increasing depth beyond a critical value leads to a decrease in test accuracy. As depth increases, the performance of the Fully-Conv Net approaches that of a wide fully connected network (shown in red).


2 Answers

Biases are tuned alongside weights by learning algorithms such as gradient descent. biases differ from weights is that they are independent of the output from previous layers. Conceptually bias is caused by input from a neuron with a fixed activation of 1, and so is updated by subtracting the just the product of the delta value and learning rate.

In a large model, removing the bias inputs makes very little difference because each node can make a bias node out of the average activation of all of its inputs, which by the law of large numbers will be roughly normal. At the first layer, the ability for this to happens depends on your input distribution. On a small network, of course you need a bias input, but on a large network, removing it makes almost no difference.

Although in a large network it has no difference, it still depends on network architecture. For instance in LSTM:

Most applications of LSTMs simply initialize the LSTMs with small random weights which works well on many problems. But this initialization effectively sets the forget gate to 0.5. This introduces a vanishing gradient with a factor of 0.5 per timestep, which can cause problems whenever the long term dependencies are particularly severe. This problem is addressed by simply initializing the forget gates bias to a large value such as 1 or 2. By doing so, the forget gate will be initialized to a value that is close to 1, enabling gradient flow.

See also:

  • The rule of bias in Neural network
  • What is bias in Neural network
  • An Empirical Exploration of Recurrent Network Architectures
like image 130
Amir Avatar answered Oct 17 '22 10:10

Amir


In most networks you have a batchnorm layer after the conv layer, which has a bias. So if you have a batchnorm layer there is no gain. See: Can not use both bias and batch normalization in convolution layers

Otherwise, from a math perspective you are learning different functions. However, it turns out that in particular if you have a very complex network for a simple problem, you might achieve almost the same thing without biases than with biases but ending up using more parameters. In my experience, using a factor of 2-4 more parameters than needed rarely hurts performance in deep learning - in particular if you regularize. So, it is hard to notice any difference. However, you might try to use few channels (I don't think depth of the network matters as much as number of channels of the convolution) and see if bias make a difference. I would guess so.

like image 40
J. Schneider Avatar answered Oct 17 '22 10:10

J. Schneider