Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

mask 0 values during normalization

I am doing normalization for datasets but the data contains a lot of 0 because of padding.

I can mask them during model training but apparently, these zero will be affected when I applied normalization.

from sklearn.preprocessing import StandardScaler,MinMaxScaler

I am currently using the Sklearn library to do the normalization

For example, given a 3D array with dimension (4,3,5) as (batch, step, features)

The number of zero-padding varied from batch to batch as these are the features I extracted from audio files, that have varying lengths, using a fixed window size.

[[[0 0 0 0 0],
  [0 0 0 0 0],
  [0 0 0 0 0]]

 [[1 2 3 4 5],
  [4 5 6 7 8],
  [9 10 11 12 13]],

 [[14 15 16 17 18],
  [0 0 0 0 0],
  [24 25 26 27 28]],

 [[0 0 0 0 0],
  [423 2 230 60 70],
  [0 0 0 0 0]]
]

I wish to perform normalization by column so

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train.reshape(-1,X_train.shape[-1])).reshape(X_train.shape)
X_test = scaler.transform(X_test.reshape(-1,X_test.shape[-1])).reshape(X_test.shape)

However, in this case, zeros are treated as effective values. For example, the minimum value of the first column should be 1 instead of 0.

Further, the 0's values are also changed after applying the scalers but I wish to keep them as 0's so I can mask them during training. model.add(tf.keras.layers.Masking(mask_value=0.0, input_shape=(X_train.shape[1], X_train.shape[2])))

Is there any way to mask them during normalization so only the 2nd step and 3rd step in this example are used in normalization?

In addition, The actual dimension of the array for my project is bigger as (2000,50,68) among the 68 features, the difference in values of the 68 features can be very large. I tried to normalize them by dividing each element by the biggest element in their row to avoid the impact from 0's but this did not work out well.


like image 233
Leo Avatar asked Oct 28 '20 06:10

Leo


People also ask

How do you normalize data from 0 to 1 in Python?

You can normalize data between 0 and 1 range by using the formula (data – np. min(data)) / (np. max(data) – np. min(data)) .

How do you normalize a list of values?

the Formula for Normalization We subtract the minimum value from every number and divide it by the range i-e: max-min. So, in output, we get the normalized value of that specific number.


1 Answers

The task of just MinMaxScaler() masking can be solved by next code.

Each other operation needs separate way of handling, if you'll mention all operations that need masking then we can solve them one-by-one basis and I'll extend my answer. E.g. keras layers can be masked by tf.keras.layers.Masking() layer as you mentioned.

Next code min/max-scales only non zero features, the rest remain zeros.

import numpy as np
from sklearn.preprocessing import MinMaxScaler

X = np.array([
     [[0, 0, 0, 0, 0],
      [0, 0, 0, 0, 0],
      [0, 0, 0, 0, 0]],

     [[1,  2,  3,  4,  5],
      [4,  5,  6,  7,  8],
      [9, 10, 11, 12, 13]],

     [[14, 15, 16, 17, 18],
      [0, 0, 0, 0, 0],
      [24, 25, 26, 27, 28]],

     [[0, 0, 0, 0, 0],
      [423, 2, 230, 60, 70],
      [0, 0, 0, 0, 0]]
], dtype = np.float64)

nz = np.any(X, -1)
X[nz] = MinMaxScaler().fit_transform(X[nz])

print(X)

Output:

[[[0.         0.         0.         0.         0.        ]
  [0.         0.         0.         0.         0.        ]
  [0.         0.         0.         0.         0.        ]]

 [[0.         0.         0.         0.         0.        ]
  [0.007109   0.13043478 0.01321586 0.05357143 0.04615385]
  [0.01895735 0.34782609 0.03524229 0.14285714 0.12307692]]

 [[0.03080569 0.56521739 0.05726872 0.23214286 0.2       ]
  [0.         0.         0.         0.         0.        ]
  [0.05450237 1.         0.10132159 0.41071429 0.35384615]]

 [[0.         0.         0.         0.         0.        ]
  [1.         0.         1.         1.         1.        ]
  [0.         0.         0.         0.         0.        ]]]

If you need to train MinMaxScaler() on one dataset and apply it later on others then you can do next:

scaler = MinMaxScaler().fit(X[np.any(X, -1)])
X[np.any(X, -1)] = scaler.transform(X[np.any(X, -1)])
Y[np.any(Y, -1)] = scaler.transform(Y[np.any(Y, -1)])
like image 79
Arty Avatar answered Oct 20 '22 13:10

Arty