My training data are saved in 3 files, each file is too large and cannot fit into memory.For each training example, the data are two dimensionality (2805 rows and 222 columns, the 222nd column is for label) and are numerical values. I would like to normalize the data before feeding into models for training. Below is my code for input_pipeline, and the data has not been normalized before creating dataset. Is there some functions in tensorflow that can do normalization for my case?
dataset = tf.data.TextLineDataset([file1, file2, file3])
# combine 2805 lines into a single example
dataset = dataset.batch(2805)
def parse_example(line_batch):
record_defaults = [[1.0] for col in range(0, 221)]
record_defaults.append([1])
content = tf.decode_csv(line_batch, record_defaults = record_defaults, field_delim = '\t')
features = tf.stack(content[0:221])
features = tf.transpose(features)
label = content[-1][-1]
label = tf.one_hot(indices = tf.cast(label, tf.int32), depth = 2)
return features, label
dataset = dataset.map(parse_example)
dataset = dataset.shuffle(1000)
# batch multiple examples
dataset = dataset.batch(batch_size)
dataset = dataset.repeat(num_epochs)
iterator = dataset.make_one_shot_iterator()
data_batch, label_batch = iterator.get_next()
take(N) if N samples is enough for it to figure out the mean & variance. layer1 = norm(input) ... The advantage of using it in the model is that the normalization mean & variance are saved as part of the model weights. So when you load the saved model, it'll use the same values it was trained with.
Yes, normalisation/scaling is typically recommended and sometimes very important. Especially for neural networks, normalisation can be very crucial because when you input unnormalised inputs to activation functions, you can get stuck in a very flat region in the domain and may not learn at all.
Normalization techniques in Machine Learning The most widely used types of normalization in machine learning are: Min-Max Scaling – Subtract the minimum value from each column's highest value and divide by the range. Each new column has a minimum value of 0 and a maximum value of 1.
There are different ways of "normalizing data". Depending which one you have in mind, it may or may not be easy to implement in your case.
If you know the fixed range(s) of your values (e.g. feature #1 has values in [-5, 5]
, feature #2 has values in [0, 100]
, etc.), you could easily pre-process your feature
tensor in parse_example()
, e.g.:
def normalize_fixed(x, current_range, normed_range):
current_min, current_max = tf.expand_dims(current_range[:, 0], 1), tf.expand_dims(current_range[:, 1], 1)
normed_min, normed_max = tf.expand_dims(normed_range[:, 0], 1), tf.expand_dims(normed_range[:, 1], 1)
x_normed = (x - current_min) / (current_max - current_min)
x_normed = x_normed * (normed_max - normed_min) + normed_min
return x_normed
def parse_example(line_batch,
fixed_range=[[-5, 5], [0, 100], ...],
normed_range=[[0, 1]]):
# ...
features = tf.transpose(features)
features = normalize_fixed(features, fixed_range, normed_range)
# ...
If your features are supposed to have approximately the same range of values, per-sample normalization could also be considered, i.e. applying normalization considering the features moments (mean, variance) for each sample:
def normalize_with_moments(x, axes=[0, 1], epsilon=1e-8):
mean, variance = tf.nn.moments(x, axes=axes)
x_normed = (x - mean) / tf.sqrt(variance + epsilon) # epsilon to avoid dividing by zero
return x_normed
def parse_example(line_batch):
# ...
features = tf.transpose(features)
features = normalize_with_moments(features)
# ...
You could apply the same procedure over a complete batch instead of per-sample, which may make the process more stable:
data_batch = normalize_with_moments(data_batch, axis=[1, 2])
Similarly, you could use tf.nn.batch_normalization
Normalizing using the mean/variance computed over the whole dataset would be the trickiest, since as you mentioned it is a large, split one.
tf.data.Dataset
isn't really meant for such global computation. A solution would be to use whatever tools you have to pre-compute the dataset moments, then use this information for your TF pre-processing.
As mentioned by @MiniQuark, Tensorflow has a Transform library you could use to preprocess your data. Have a look at the Get Started, or for instance at the tft.scale_to_z_score()
method for sample normalization.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With