The keras BatchNormalization
layer uses axis=-1
as a default value and states that the feature axis is typically normalized. Why is this the case?
I suppose this is surprising because I'm more familiar with using something like StandardScaler
, which would be equivalent to using axis=0
. This would normalize the features individually.
Is there a reason why samples are individually normalized by default (i.e. axis=-1
) in keras as opposed to features?
Edit: example for concreteness
It's common to transform data such that each feature has zero mean and unit variance. Let's just consider the "zero mean" part with this mock dataset, where each row is a sample:
>>> data = np.array([[ 1, 10, 100, 1000], [ 2, 20, 200, 2000], [ 3, 30, 300, 3000]]) >>> data.mean(axis=0) array([ 2., 20., 200., 2000.]) >>> data.mean(axis=1) array([ 277.75, 555.5 , 833.25])
Wouldn't it make more sense to subtract the axis=0
mean, as opposed to the axis=1
mean? Using axis=1
, the units and scales can be completely different.
Edit 2:
The first equation of section 3 in this paper seems to imply that axis=0
should be used for calculating expectations and variances for each feature individually, assuming you have an (m, n) shaped dataset where m is the number of samples and n is the number of features.
Edit 3: another example
I wanted to see the dimensions of the means and variances BatchNormalization
was calculating on a toy dataset:
import pandas as pd import numpy as np from sklearn.datasets import load_iris from keras.optimizers import Adam from keras.models import Model from keras.layers import BatchNormalization, Dense, Input iris = load_iris() X = iris.data y = pd.get_dummies(iris.target).values input_ = Input(shape=(4, )) norm = BatchNormalization()(input_) l1 = Dense(4, activation='relu')(norm) output = Dense(3, activation='sigmoid')(l1) model = Model(input_, output) model.compile(Adam(0.01), 'categorical_crossentropy') model.fit(X, y, epochs=100, batch_size=32) bn = model.layers[1] bn.moving_mean # <tf.Variable 'batch_normalization_1/moving_mean:0' shape=(4,) dtype=float32_ref>
The input X has shape (150, 4), and the BatchNormalization
layer calculated 4 means, which means it operated over axis=0
.
If BatchNormalization
has a default of axis=-1
then shouldn't there be 150 means?
The keras BatchNormalization layer uses axis=-1 as a default value and states that the feature axis is typically normalized. Why is this the case? I suppose this is surprising because I'm more familiar with using something like StandardScaler, which would be equivalent to using axis=0. This would normalize the features individually.
tf.keras.layers.Normalization(axis=-1, mean=None, variance=None, **kwargs) Feature-wise normalization of the data. This layer will coerce its inputs into a distribution centered around 0 with standard deviation 1. It accomplishes this by precomputing the mean and variance of the data, and calling (input - mean) / sqrt (var) at runtime.
When we compute a BatchNormalization along an axis, we preserve the dimensions of the array, and we normalize with respect to the mean and standard deviation over every other axis. So in your 2D example BatchNormalization with axis=1 is subtracting the mean for axis=0, just as you expect. This is why bn.moving_mean has shape (4,).
In between every Convolutional layer I use a BatchNormalization layer before an Activation layer. The default value for BatchNormalization is "axis=-1". Should I leave it as it is or should I make it with "axis=2" which corresponds to the "frequency" axis?
The confusion is due to the meaning of axis
in np.mean
versus in BatchNormalization
.
When we take the mean along an axis, we collapse that dimension and preserve all other dimensions. In your example data.mean(axis=0)
collapses the 0-axis
, which is the vertical dimension of data
.
When we compute a BatchNormalization
along an axis, we preserve the dimensions of the array, and we normalize with respect to the mean and standard deviation over every other axis. So in your 2D
example BatchNormalization
with axis=1
is subtracting the mean for axis=0
, just as you expect. This is why bn.moving_mean
has shape (4,)
.
I know this post is old, but am still answering it because the confusion still lingers on in Keras documentation. I had to go through the code to figure this out:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With