keras BatchNormalization axis clarification

Tags:

The keras BatchNormalization layer uses axis=-1 as a default value and states that the feature axis is typically normalized. Why is this the case?

I suppose this is surprising because I'm more familiar with using something like StandardScaler, which would be equivalent to using axis=0. This would normalize the features individually.

Is there a reason why samples are individually normalized by default (i.e. axis=-1) in keras as opposed to features?

Edit: example for concreteness

It's common to transform data such that each feature has zero mean and unit variance. Let's just consider the "zero mean" part with this mock dataset, where each row is a sample:

>>> data = np.array([[   1,   10,  100, 1000],                      [   2,   20,  200, 2000],                      [   3,   30,  300, 3000]])  >>> data.mean(axis=0) array([    2.,    20.,   200.,  2000.])  >>> data.mean(axis=1) array([ 277.75,  555.5 ,  833.25])

Wouldn't it make more sense to subtract the axis=0 mean, as opposed to the axis=1 mean? Using axis=1, the units and scales can be completely different.

Edit 2:

The first equation of section 3 in this paper seems to imply that axis=0 should be used for calculating expectations and variances for each feature individually, assuming you have an (m, n) shaped dataset where m is the number of samples and n is the number of features.

Edit 3: another example

I wanted to see the dimensions of the means and variances BatchNormalization was calculating on a toy dataset:

import pandas as pd import numpy as np from sklearn.datasets import load_iris  from keras.optimizers import Adam from keras.models import Model from keras.layers import BatchNormalization, Dense, Input   iris = load_iris() X = iris.data y = pd.get_dummies(iris.target).values  input_ = Input(shape=(4, )) norm = BatchNormalization()(input_) l1 = Dense(4, activation='relu')(norm) output = Dense(3, activation='sigmoid')(l1)  model = Model(input_, output) model.compile(Adam(0.01), 'categorical_crossentropy') model.fit(X, y, epochs=100, batch_size=32)  bn = model.layers[1] bn.moving_mean  # <tf.Variable 'batch_normalization_1/moving_mean:0' shape=(4,) dtype=float32_ref>

The input X has shape (150, 4), and the BatchNormalization layer calculated 4 means, which means it operated over axis=0.

If BatchNormalization has a default of axis=-1 then shouldn't there be 150 means?

513

asked Nov 28 '17 18:11

trianta2

2 Answers

The confusion is due to the meaning of axis in np.mean versus in BatchNormalization.

When we take the mean along an axis, we collapse that dimension and preserve all other dimensions. In your example data.mean(axis=0) collapses the 0-axis, which is the vertical dimension of data.

When we compute a BatchNormalization along an axis, we preserve the dimensions of the array, and we normalize with respect to the mean and standard deviation over every other axis. So in your 2D example BatchNormalization with axis=1 is subtracting the mean for axis=0, just as you expect. This is why bn.moving_mean has shape (4,).

answered Sep 22 '22 13:09

Imran

I know this post is old, but am still answering it because the confusion still lingers on in Keras documentation. I had to go through the code to figure this out:

The axis variable which is documented as being an integer can actually be a list of integers denoting multiple axes. So for e.g. if my input had an image in the NHWC or NCHW formats, provide axis=[1,2,3] if I wanted to perform BatchNormalization in the way that the OP wants (i.e. normalize across the batch dimension only).
The axis list (or integer) should contain the axes that you do not want to reduce while calculating the mean and variance. In other words it is the complement of the axes along which you want to normalize - quite opposite of what the documentation appears to say if you go by the conventional definition of 'axes'. So for e.g. if your input I was of shape (N,H,W,C) or (N,C,H,W) - i.e. the first dimension was the batch dimension and you only wanted the mean and variance to be calculated across the batch dimension you should supply axis=[1,2,3]. This will cause Keras to calculate mean M and variance V tensors of shape (1,H,W,C) or (1,C,H,W) respectively - i.e. batch dimension would get marginalized/reduced owing to the aggregation (i.e. mean or variance is calculated across the first dimension). In later operations like (I-M) and (I-M)/V, the first dimension of M and V would get broadcast to all of the N samples of the batch.
The BatchNorm layer ends up calling tf.nn.moments with axes=(1,) in this example! That's so because the definition of axes in tf.nn.moments is the correct one.
Similarly tf.nn.moments calls tf.nn.reduce_mean, where again the definition of axes is the correct one (i.e. opposite of tf.keras.layers.BatchNormalization).
That said, the BatchNormalization paper suggests normalizing across the HxW spatial map in additon to the batch dimension (N). Hence if one were to follow that advice, then axis would only include the channel dimension (C) because that's the only remaining dimension that you didn't want to reduce. The Keras documentation is probably alluding to this, although it is quite cryptic.

answered Sep 23 '22 13:09

BoltzmannMachine

Related questions
                            
                                Unzipping directory structure with python
                            
                                Best way to define multidimensional dictionaries in python? [duplicate]
                            
                                In python how to get name of a class inside its static method
                            
                                python: iterate a specific range in a list
                            
                                Pip Install -r continue past installs that fail
                            
                                Python dictionary in to html table
                            
                                Mocking __init__() for unittesting
                            
                                Scikit-learn is returning coefficient of determination (R^2) values less than -1
                            
                                How does the pyspark mapPartitions function work?
                            
                                How to repeat individual characters in strings in Python
                            
                                How to use AirFlow to run a folder of python files?
                            
                                Dependency version syntax for Python Poetry
                            
                                custom tagging with nltk
                            
                                Python and BeautifulSoup encoding issues [duplicate]
                            
                                Python how to read output from pexpect child?
                            
                                Install python wheel file without using pip
                            
                                How do I subtract the previous row from the current row in a pandas dataframe and apply it to every row; without using a loop?
                            
                                DLL load failed when importing PyQt5
                            
                                AssertionError: Egg-link .. does not match installed location of ReviewBoard (at /...)
                            
                                How to remove rows in a Pandas dataframe if the same row exists in another dataframe?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

keras BatchNormalization axis clarification

Tags:

python

machine-learning

deep-learning

keras

trianta2

People also ask

2 Answers

Imran

BoltzmannMachine

Recent Activity

Donate For Us