Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why scale across rows not columns for standardizing (preprocessing) of Data before clustering

I am very confused and could not find a convincing answer on the internet to the following question regarding the data preprocessing clustering.

According to Python documentation, when we do preprocessing using the built-in command in sckit learn library given the data is formulated as N x D matrix where rows are the samples and columns are the features, we make the mean across the rows to be zero and at the same time standard deviation across rows are unity like the following:

X_scaled.mean(axis=0)
array([ 0.,  0.,  0.])

X_scaled.std(axis=0)
array([ 1.,  1.,  1.])

My question is shouldn't we make the mean across the column (features instead of samples) to be zero and the same thing for standard deviation since we are trying to standardize the features not the samples. Websites and other resources always standardize across rows but they never explain why?

like image 227
justin Avatar asked Dec 13 '22 16:12

justin


2 Answers

I would expect that you'd want to normalize the values for a given feature, across the samples. If you normalize a given sample's data across its features, you've tossed out a lot of information. That would be for comparing features (which rarely makes sense), rather than for comparing samples for a feature.

I don't know numpy or sklearn so take this with a grain of salt, but when normalizing, you want to normalize (using the same parameters) all data for a given feature, to bring all the values for that feature into the range of (-1 ... +1), with the mean as zero (or something like that). You'd do this separately for each feature, so they'll all end up in that range, with each feature's mean at zero.

Consider an example, if you normalized across all the features for a given sample.

        height weight age
person1 180    65     50
person2 140    45     50

If we normalize the values for person1 across the features, then do the same for person2, then person2 will seem to have a different age than person1!

If we normalize across the samples for a given column, then the relationships will still hold. Their ages will match; person1 will be taller, and person2 will weigh less. But all values for all features will fit within the distribution rules necessary for subsequent analysis.

like image 187
Jeff Learman Avatar answered Jan 05 '23 00:01

Jeff Learman


There is a place for normalizing your samples. One example is when your features are counts. In this case, normalizing each sample to unit l1-norm effectively changes each feature to a percentage of the total count for that sample.

Sklearn's Normalizer is made for sample normalization and can normalize to l1 or l2 norm.

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html

like image 24
Bert Kellerman Avatar answered Jan 05 '23 00:01

Bert Kellerman