Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

batch normalization, yes or no?

I use Tensorflow 1.14.0 and Keras 2.2.4. The following code implements a simple neural network:

import numpy as np
np.random.seed(1)
import random
random.seed(2)
import tensorflow as tf
tf.set_random_seed(3)

from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import Input, Dense, Activation


x_train=np.random.normal(0,1,(100,12))

model = Sequential()
model.add(Dense(8, input_shape=(12,)))
# model.add(tf.keras.layers.BatchNormalization())
model.add(Activation('linear'))
model.add(Dense(12))
model.add(Activation('linear'))
model.compile(optimizer='adam', loss='mean_squared_error')
model.fit(x_train, x_train,epochs=20, validation_split=0.1, shuffle=False,verbose=2)

The final val_loss after 20 epochs is 0.7751. When I uncomment the only comment line to add the batch normalization layer, the val_loss changes to 1.1230.

My main problem is way more complicated, but the same thing occurs. Since my activation is linear, it does not matter if I put the batch normalization after or before the activation.

Questions: Why batch normalization cannot help? Is there anything I can change so that the batch normalization improves the result without changing the activation functions?

Update after getting a comment:

An NN with one hidden layer and linear activations is kind of like PCA. There are tons of papers on this. For me, this setting gives minimal MSE among all combinations of activation functions for the hidden layer and output.

Some resources that state linear activations mean PCA:

https://arxiv.org/pdf/1702.07800.pdf

https://link.springer.com/article/10.1007/BF00275687

https://www.quora.com/How-can-I-make-a-neural-network-to-work-as-a-PCA

like image 704
Albert Avatar asked Oct 29 '19 17:10

Albert


People also ask

Should I use batch normalization?

Batch normalization solves a major problem called internal covariate shift. It helps by making the data flowing between intermediate layers of the neural network look, this means you can use a higher learning rate. It has a regularizing effect which means you can often remove dropout.

When should I not use batch normalization?

Not good for Recurrent Neural Networks Batch normalization can be applied in between stacks of RNN, where normalization is applied “vertically” i.e. the output of each RNN. But it cannot be applied “horizontally” i.e. between timesteps, as it hurts training because of exploding gradients due to repeated rescaling.

Can you use batch normalization with RNNS Why or why not?

No, you cannot use Batch Normalization on a recurrent neural network, as the statistics are computed per batch, this does not consider the recurrent part of the network. Weights are shared in an RNN, and the activation response for each "recurrent loop" might have completely different statistical properties.

Should I use dropout or batch normalization?

Dropout is meant to block information from certain neurons completely to make sure the neurons do not co-adapt. So, the batch normalization has to be after dropout otherwise you are passing information through normalization statistics.

Why does batch normalization work?

While the effect of batch normalization is evident, the reasons behind its effectiveness remain under discussion. It was believed that it can mitigate the problem of internal covariate shift, where parameter initialization and changes in the distribution of the inputs of each layer affects the learning rate of the network.

What are the methods of normalization in neural network?

Procedures 1 Batch Normalizing Transform. In a neural network, batch normalization is achieved through a normalization step that fixes the means and variances of each layer's inputs. 2 Backpropagation. ... 3 Inference with Batch-Normalized Networks. ...

How long does it take to learn batch normalization?

An updated explanation of Batch Normalization through 3 levels of understanding : in 30 seconds, 3 minutes, and a comprehensive guide ; Cover the key elements to have in mind to get the most out of BN ;

How do you normalise a batch of data in Python?

Batch normalisation normalises a layer input by subtracting the mini-batch mean and dividing it by the mini-batch standard deviation. Mini-batch refers to one batch of data supplied for any given epoch, a subset of the whole training data.


1 Answers

Yes.

The behavior you're observing is a bug - and you don't need BN to see it; plot to the left is for #V1, to the right is for #V2:

enter image description here

#V1
model = Sequential()
model.add(Dense(8, input_shape=(12,)))
#model.add(Activation('linear')) <-- uncomment == #V2
model.add(Dense(12))
model.compile(optimizer='adam', loss='mean_squared_error')

Clearly nonsensical, as Activation('linear') after a layer with activation=None (=='linear') is an identity: model.layers[1].output.name == 'activation/activation/Identity:0'. This can be confirmed further by fetching and plotting intermediate layer outputs, which are identical for 'dense' and 'activation' - will omit here.

So, the activation does literally nothing, except it doesn't - somewhere along the commit chain between 1.14.0 and 2.0.0, this was fixed, though I don't know where. Results w/ BN using TF 2.0.0 w/ Keras 2.3.1 below:

val_loss = 0.840 # without BN
val_loss = 0.819 # with BN

enter image description here


Solution: update to TensorFlow 2.0.0, Keras 2.3.1.

Tip: use Anaconda w/ virtual environment. If you don't have any virtual envs yet, run:

conda create --name tf2_env --clone base
conda activate tf2_env
conda uninstall tensorflow-gpu
conda uninstall keras
conda install -c anaconda tensorflow-gpu==2.0.0
conda install -c conda-forge keras==2.3.1

May be a bit more involved than this, but that's subject of another question.


UPDATE: importing from keras instead of tf.keras also solves the problem.


Disclaimer: BN remains a 'controversial' layer in Keras, yet to be fully fixed - see Relevant Git; I plan on investigating it myself eventually, but for your purposes, this answer's fix should suffice.

I also recommend familiarizing yourself with BN's underlying theory, in particular regarding its train vs. inference operation; in a nutshell, batch sizes under 32 is a pretty bad idea, and dataset should be sufficiently large to allow BN to accurately approximate test-set gamma and beta.


Code used:

x_train=np.random.normal(0, 1, (100, 12))

model = Sequential()
model.add(Dense(8, input_shape=(12,)))
#model.add(Activation('linear'))
#model.add(tf.keras.layers.BatchNormalization())
model.add(Dense(12))
model.compile(optimizer='adam', loss='mean_squared_error')

W_sum_all = []  # fit rewritten to allow runtime weight collection
for _ in range(20):
    for i in range(9):
        x = x_train[i*10:(i+1)*10]
        model.train_on_batch(x, x)

        W_sum_all.append([])
        for layer in model.layers:
            if layer.trainable_weights != []:
                W_sum_all[-1] += [np.sum(layer.get_weights()[0])]
model.evaluate(x[-10:], x[-10:])

plt.plot(W_sum_all)
plt.title("Sum of weights (#V1)", weight='bold', fontsize=14)
plt.legend(labels=["dense", "dense_1"], fontsize=14)
plt.gcf().set_size_inches(7, 4)

Imports/pre-executions:

import numpy as np
np.random.seed(1)
import random
random.seed(2)
import tensorflow as tf
if tf.__version__[0] == '2':
    tf.random.set_seed(3)
else:
    tf.set_random_seed(3)

import matplotlib.pyplot as plt
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import Input, Dense, Activation
like image 122
OverLordGoldDragon Avatar answered Oct 23 '22 05:10

OverLordGoldDragon