Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Standardization before or after categorical encoding?

I'm working on a regression algorithm, in this case k-NearestNeighbors to predict a certain price of a product.

So I have a Training set which has only one categorical feature with 4 possible values. I've dealt with it using a one-to-k categorical encoding scheme which means now I have 3 more columns in my Pandas DataFrame with a 0/1 depending the value present.

The other features in the DataFrame are mostly distances like latitud - longitude for locations and prices, all numerical.

Should I standardize (Gaussian distribution with zero mean and unit variance) and normalize before or after the categorical encoding?

I'm thinking it might be benefitial to normalize after encoding so that every feature is to the estimator as important as every other when measuring distances between neighbors but I'm not really sure.

like image 827
Franch Avatar asked Nov 13 '17 19:11

Franch


People also ask

When should data be standardized?

Standardization is useful when your data has varying scales and the algorithm you are using does make assumptions about your data having a Gaussian distribution, such as linear regression, logistic regression, and linear discriminant analysis.

Should you standardize categorical variables?

Normalization/standardization of features is done to bring all features to a similar scale. When you one hot encode categorical variables they are either 0/1 hence there is not much scale difference like 10~1000 hence there is no need to apply techniques for normalization/standardization.

Do you standardize one hot encoded data?

NO, you do not standardize labelsAmong the features, it has height, lenght, width, weight, # cylinders, along with a number of other label and numerical features.

When should normalize and standardize?

Normalized Data Vs Standardized DataNormalization is used when the data doesn't have Gaussian distribution whereas Standardization is used on data having Gaussian distribution. Normalization scales in a range of [0,1] or [-1,1]. Standardization is not bounded by range. Normalization is highly affected by outliers.


2 Answers

Seems like an open problem, thus I'd like to answer even though it's late. I am also unsure how much the similarity between the vectors would be affected, but in my practical experience you should first encode your features and then scale them. I have tried the opposite with scikit learn preprocessing.StandardScaler() and it doesn't work if your feature vectors do not have the same length: scaler.fit(X_train) yields ValueError: setting an array element with a sequence. I can see from your description that your data have a fixed number of features, but I think for generalization purposes (maybe you have new features in the future?), it's good to assume that each data instance has a unique feature vector length. For instance, I transform my text documents into word indices with Keras text_to_word_sequence (this gives me the different vector length), then I convert them to one-hot vectors and then I standardize them. I have actually not seen a big improvement with the standardization. I think you should also reconsider which of your features to standardize, as dummies might not need to be standardized. Here it doesn't seem like categorical attributes need any standardization or normalization. K-nearest neighbors is distance-based, thus it can be affected by these preprocessing techniques. I would suggest trying either standardization or normalization and check how different models react with your dataset and task.

like image 189
KLaz Avatar answered Oct 16 '22 10:10

KLaz


After. Just imagine that you have not numerical variables in your column but strings. You can't standardize strings - right? :)

But given what you wrote about categories. If they are represented with values, I suppose there is some kind of ranking inside. Probably, you can use raw column rather than one-hot-encoded. Just thoughts.

like image 1
avchauzov Avatar answered Oct 16 '22 12:10

avchauzov