Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Normalizing data with binary and continuous variables for machine learning

Having the following data: enter image description here

I'm trying to figure out the right normalization pre-process. Some of the features are categorical features, encoded as 'one-hot-encoding' (category a-c), some features represent time since an event, and some represent a release version.

I was thinking of using sklearn MinMaxScaler, to normalize the data from 0 to 1, but I'm not sure it is the right approach.

How do you decide the appropriate normalization technique for your data?

like image 749
Shlomi Schwartz Avatar asked Sep 26 '18 08:09

Shlomi Schwartz


Video Answer


1 Answers

There is not a silver bullet, but some principles apply:

  1. The reason for normalization is so that no feature overly dominates the gradient of the loss function. Some algorithms are better at dealing with unnormalized features than others, I think, but in general if your features have vastly different scales you could get in trouble. So normalizing to the range 0 - 1 is sensible.
  2. You want to maximize the entropy of your features, to help the algorithm seperate the examples. You achieve this by spreading the values as much as possible over the given range (0-1). Sometimes it could be valuable to scale some parts of the feature space differently than others. For example, if there are ten versions, but 6 are essentially the same with the other four being very different among each other, then it might make sense to scale such that the first six versions are close together and the rest more spread out.
  3. Point 2 means that now the scaling is part of your training / trained algorithm, keep that in mind! If you are doing cross-validation, scale the folds seperately or you will have trained a part of the whole with test data.
  4. Some algorithms (Naive Bayes comes to mind) don't work with continuous values at all, but rather categorical values. Make sure you know what your chosen algorithm can work with.
like image 138
kutschkem Avatar answered Sep 28 '22 04:09

kutschkem