I am working on a Kaggle dataset: https://www.kaggle.com/c/santander-customer-satisfaction. I understand some sort of feature scaling is needed before PCA. I read from this post and this post that normalization is best, however it was standardizing that gave me the highest performance (AUC-ROC).
I tried all the feature scaling methods from sklearn, including: RobustScaler(), Normalizer(), MinMaxScaler(), MaxAbsScaler() and StandardScaler(). Then using the scaled data, I did PCA. But it turns out that the optimal numbers of PCA's obtained vary greatly between these methods.
Here's the code I use:
# Standardize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
# Find the optimal number of PCA
pca = PCA(n_components=X_train_scaled.shape[1])
pca.fit(X_train_scaled)
ratios = pca.explained_variance_ratio_
# Plot the explained variance ratios
x = np.arange(X_train_scaled.shape[1])
plt.plot(x, np.cumsum(ratios), '-o')
plt.xlabel("Number of PCA's")
plt.ylabel("Cumulated Sum of Explained Variance")
plt.title("Variance Explained by PCA's")
# Find the optimal number of PCA's
for i in range(np.cumsum(ratios).shape[0]):
if np.cumsum(ratios)[i] >= 0.99:
num_pca = i + 1
print "The optimal number of PCA's is: {}".format(num_pca)
break
else:
continue
These are the different number of PCA's I got using different scalers.
So, my question is, which method is the right one for feature scaling in this situation? Thanks!
Anyways, the correct answer should be: it depends. Typically a Feature Selection step comes after the PCA (with a optimization parameter describing the number of features and Scaling comes before PCA. However, depending on the problem this my change. You might want to apply PCA only on a subset of features.
Principal Component Analysics (PCA) is also a good example of when feature scaling is important since we are interested in the components that maximize the variance and therefore we need to ensure that we are comparing apples to apples.
Before applying PCA, the takeaway would always check the variance of each feature in the dataset, and if there is a large gap between the variances, scale the data with a proper scaler.
The rule of thumb is that if your data is already on a different scale (e.g. every feature is XX per 100 inhabitants), scaling it will remove the information contained in the fact that your features have unequal variances. If the data is on different scales, then you should normalize it before running PCA.
Data on which the PCA-transformation is calculated should be normalized, meaning in this case:
This basically is sklearns StandardScaler
, which i would prefer of your candidates. The reasons are explained on Wiki and also here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With