Which feature scaling method to use before PCA?

Tags:

I am working on a Kaggle dataset: https://www.kaggle.com/c/santander-customer-satisfaction. I understand some sort of feature scaling is needed before PCA. I read from this post and this post that normalization is best, however it was standardizing that gave me the highest performance (AUC-ROC).

I tried all the feature scaling methods from sklearn, including: RobustScaler(), Normalizer(), MinMaxScaler(), MaxAbsScaler() and StandardScaler(). Then using the scaled data, I did PCA. But it turns out that the optimal numbers of PCA's obtained vary greatly between these methods.

Here's the code I use:

# Standardize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Find the optimal number of PCA 
pca = PCA(n_components=X_train_scaled.shape[1])
pca.fit(X_train_scaled)
ratios = pca.explained_variance_ratio_

# Plot the explained variance ratios
x = np.arange(X_train_scaled.shape[1])
plt.plot(x, np.cumsum(ratios), '-o')
plt.xlabel("Number of PCA's")
plt.ylabel("Cumulated Sum of Explained Variance")
plt.title("Variance Explained by PCA's")

# Find the optimal number of PCA's
for i in range(np.cumsum(ratios).shape[0]):
  if np.cumsum(ratios)[i] >= 0.99:
    num_pca = i + 1
    print "The optimal number of PCA's is: {}".format(num_pca)
    break
  else:
    continue

These are the different number of PCA's I got using different scalers.

RobustScaler: 9
Normalizer: 26
MinMaxScaler: 45
MaxAbsScaler: 45
StandardScaler: 142

So, my question is, which method is the right one for feature scaling in this situation? Thanks!

494

asked May 14 '16 01:05

George Liu

1 Answers

Data on which the PCA-transformation is calculated should be normalized, meaning in this case:

zero mean
unit variance

This basically is sklearns StandardScaler, which i would prefer of your candidates. The reasons are explained on Wiki and also here.

sklearns Normalizer is missing zero-mean
Both Min-Max scalers are missing unit-variance
Robust scaler could work on some data (outliers!), but i would prefer StandardScaler.

183

answered Sep 28 '22 08:09

sascha

Related questions
                            
                                "UserWarning: An input could not be retrieved. It could be because a worker has died. We do not have any information on the lost sample."
                            
                                Multiple pipelines that merge within a sklearn Pipeline?
                            
                                Tied weights in Autoencoder
                            
                                Convert dataframe columns of object type to float
                            
                                How to Merge Numerical and Embedding Sequential Models to treat categories in RNN
                            
                                Improving k-means clustering
                            
                                What to do first: Feature Selection or Model Parameters Setting?
                            
                                How to prune a tree in R?
                            
                                What does sklearn "RidgeClassifier" do?
                            
                                Java Open Source Text Mining Frameworks [closed]
                            
                                Scalable or online out-of-core multi-label classifiers
                            
                                Put customized functions in Sklearn pipeline
                            
                                Tensorflow feature column for variable list of values
                            
                                Combining Rolling Origin Forecast Resampling and Group V-Fold Cross-Validation in rsample
                            
                                LSTM Followed by Mean Pooling
                            
                                EM score in SQuAD Challenge
                            
                                Pytorch ValueError: optimizer got an empty parameter list
                            
                                What algorithms are suitable for this simple machine learning problem?
                            
                                SVM in Matlab: Meaning of Parameter 'box constraint' in function fitcsvm
                            
                                Intuition for perceptron weight update rule

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Which feature scaling method to use before PCA?

Tags:

machine-learning

scikit-learn

George Liu

People also ask

1 Answers

sascha

Recent Activity

Donate For Us