Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Feature Scaling required or not

I am working with sample data set to learn clustering. This data set contains number of occurrences for the keywords.

Since all are number of occurrences for the different keywords, will it be OK not to scale the values and use them as it is?

I read couple of articles on internet where its emphasized that scaling is important as it will adjust the relativity of the frequency. Since most of frequencies are 0 (95%+), z score scaling will change the shape of distribution, which I am feeling could be problem as I am changing the nature of data.

I am thinking of not changing values at all to avoid this. Will that affect the quality of results I get from the clustering?

like image 631
Yantraguru Avatar asked Apr 24 '15 08:04

Yantraguru


1 Answers

As it was already noted, the answer heavily depends on an algorithm being used.

If you're using distance-based algorithms with (usually default) Euclidean distance (for example, k-Means or k-NN), it'll rely more on features with bigger range just because a "typical difference" of values of that feature is bigger.

Non-distance based models can be affected, too. Though one might think that linear models do not get into this category since scaling (and translating, if needed) is a linear transformation, so if it makes results better, then the model should learn it, right? Turns out, the answer is no. The reason is that no one uses vanilla linear models, they're always used with with some sort of a regularization which penalizes too big weights. This can prevent your linear model from learning scaling from data.

There are models that are independent of the feature scale. For example, tree-based algorithms (decision trees and random forests) are not affected. A node of a tree partitions your data into 2 sets by comparing a feature (which splits dataset best) to a threshold value. There's no regularization for the threshold (because one should keep height of the tree small), so it's not affected by different scales.

That being said, it's usually advised to standardize (subtract mean and divide by standard deviation) your data.

like image 167
Artem Sobolev Avatar answered Nov 01 '22 08:11

Artem Sobolev