mask 0 values during normalization

Tags:

I am doing normalization for datasets but the data contains a lot of 0 because of padding.

I can mask them during model training but apparently, these zero will be affected when I applied normalization.

from sklearn.preprocessing import StandardScaler,MinMaxScaler

I am currently using the Sklearn library to do the normalization

For example, given a 3D array with dimension (4,3,5) as (batch, step, features)

The number of zero-padding varied from batch to batch as these are the features I extracted from audio files, that have varying lengths, using a fixed window size.

[[[0 0 0 0 0],
  [0 0 0 0 0],
  [0 0 0 0 0]]

 [[1 2 3 4 5],
  [4 5 6 7 8],
  [9 10 11 12 13]],

 [[14 15 16 17 18],
  [0 0 0 0 0],
  [24 25 26 27 28]],

 [[0 0 0 0 0],
  [423 2 230 60 70],
  [0 0 0 0 0]]
]

I wish to perform normalization by column so

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train.reshape(-1,X_train.shape[-1])).reshape(X_train.shape)
X_test = scaler.transform(X_test.reshape(-1,X_test.shape[-1])).reshape(X_test.shape)

However, in this case, zeros are treated as effective values. For example, the minimum value of the first column should be 1 instead of 0.

Further, the 0's values are also changed after applying the scalers but I wish to keep them as 0's so I can mask them during training. model.add(tf.keras.layers.Masking(mask_value=0.0, input_shape=(X_train.shape[1], X_train.shape[2])))

Is there any way to mask them during normalization so only the 2nd step and 3rd step in this example are used in normalization?

In addition, The actual dimension of the array for my project is bigger as (2000,50,68) among the 68 features, the difference in values of the 68 features can be very large. I tried to normalize them by dividing each element by the biggest element in their row to avoid the impact from 0's but this did not work out well.

233

asked Oct 28 '20 06:10

Leo

1 Answers

The task of just MinMaxScaler() masking can be solved by next code.

Each other operation needs separate way of handling, if you'll mention all operations that need masking then we can solve them one-by-one basis and I'll extend my answer. E.g. keras layers can be masked by tf.keras.layers.Masking() layer as you mentioned.

Next code min/max-scales only non zero features, the rest remain zeros.

import numpy as np
from sklearn.preprocessing import MinMaxScaler

X = np.array([
     [[0, 0, 0, 0, 0],
      [0, 0, 0, 0, 0],
      [0, 0, 0, 0, 0]],

     [[1,  2,  3,  4,  5],
      [4,  5,  6,  7,  8],
      [9, 10, 11, 12, 13]],

     [[14, 15, 16, 17, 18],
      [0, 0, 0, 0, 0],
      [24, 25, 26, 27, 28]],

     [[0, 0, 0, 0, 0],
      [423, 2, 230, 60, 70],
      [0, 0, 0, 0, 0]]
], dtype = np.float64)

nz = np.any(X, -1)
X[nz] = MinMaxScaler().fit_transform(X[nz])

print(X)

Output:

[[[0.         0.         0.         0.         0.        ]
  [0.         0.         0.         0.         0.        ]
  [0.         0.         0.         0.         0.        ]]

 [[0.         0.         0.         0.         0.        ]
  [0.007109   0.13043478 0.01321586 0.05357143 0.04615385]
  [0.01895735 0.34782609 0.03524229 0.14285714 0.12307692]]

 [[0.03080569 0.56521739 0.05726872 0.23214286 0.2       ]
  [0.         0.         0.         0.         0.        ]
  [0.05450237 1.         0.10132159 0.41071429 0.35384615]]

 [[0.         0.         0.         0.         0.        ]
  [1.         0.         1.         1.         1.        ]
  [0.         0.         0.         0.         0.        ]]]

If you need to train MinMaxScaler() on one dataset and apply it later on others then you can do next:

scaler = MinMaxScaler().fit(X[np.any(X, -1)])
X[np.any(X, -1)] = scaler.transform(X[np.any(X, -1)])
Y[np.any(Y, -1)] = scaler.transform(Y[np.any(Y, -1)])

answered Oct 20 '22 13:10

Arty

Related questions
                            
                                Debug Exact Cover Pentominoes, Wikipedia example incomplete? OR... I'm misunderstanding something (includes code)
                            
                                How to add a row to every group with pandas groupby?
                            
                                How can I correctly include a path dependency in pyproject.toml?
                            
                                How to use the past with HuggingFace Transformers GPT-2?
                            
                                Flask-Mail queue messages to be sent to different emails
                            
                                Moving Celery chain to a dead letter queue automatically when a task within fails
                            
                                Debugging Jinja2 templates in VSCode
                            
                                How to implement role based access control in Flask?
                            
                                How to deal with different state space size in reinforcement learning?
                            
                                how to calculate the minimum unfairness sum of a list
                            
                                Find paired records after groupby Python
                            
                                Test with FastAPI TestClient returns 422 status code
                            
                                How to speed up numpy.all and numpy.nonzero()?
                            
                                Can't open PDF file with PyPDF2
                            
                                Is there a way to get the original link from which a file was download to Python?
                            
                                How to ignore comments inside string literals
                            
                                Use django PasswordResetView functionality in my own view
                            
                                Advice on vectorizing block-wise operations in Numpy
                            
                                Sending large dictionary via API call breaks development server
                            
                                How to reindex with MultiIndex?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

mask 0 values during normalization

Tags:

python

machine-learning

scikit-learn

Leo

People also ask

1 Answers

Arty

Recent Activity

Donate For Us